The great replication crisis

by John Q on September 2, 2015

There’s been a lot of commentary on a recent study by the Replication Project that attempted to replicate 100 published studies in psychology, all of which found statistically significant effects of some kind. The results were pretty dismal. Only about one-third of the replications observed a statistically significant effect, and the average effect size was about half that originally reported.

Unfortunately, most of the discussion of this study I’ve seen, notably in the New York Times, has missed the key point, namely the problem of publication bias. The big problem is that, under standard 20th century procedures, research reports will only be published if the effect observed is “statistically significant”, which, broadly speaking means that the average value of the observed effect is more than twice as large as the estimated standard error. According to the standard classical hypothesis testing theory, the probability that such an effect will be observed by chance, when in reality there is no effect, is less than 5 per cent.

There are two problems here, traditionally called Type I and Type II error. The classical hypothesis testing focuses on reducing Type I error, the possibility of finding an effect when none exists in reality, to 5 per cent. Unfortunately, when you do lots of tests, you get 5 per cent of a large number. If all the original studies were Type I errors, we’d expect only 5 per cent to survive replication.

In fact, the outcome observed in the Replication Study is entirely consistent with the possibility that all the failed replications are subject to Type II error, that is, failure to demonstrate an effect that is there in reality

I’m going to illustrate this with a numerical example[^1].

Suppose each of the 100 studies was looking at a treatment (any kind of intervention of change) of some kind, which results in shifting some variable of interest by 0.1 standard deviations (in the context of IQ test scores, for example, this would be a shift of 1.5 IQ points). Suppose the population parameters in the absence of treatment are known, and we have a sample of 225 treatments. We’d expect the sample mean value obtained in this way to be, on average 0.1 standard deviations higher than the value for the population at large. But the sample mean itself is a random variable, with a standard deviation equal to the population standard deviation divided by sqrt(225) = 15. That is, if we normalize the population distribution to have mean zero and standard deviation 1, the sample mean will have mean 0.1 and standard deviation 0.066. That in turn means that about 30 per cent of the observed samples will have a value greater than twice the sample standard deviation, which is roughtly the level required to find statistical significance.

Under best practice 20th procedure, the experimenters would report the effect if it passes the standard test for statistical significance, and dump the experiment otherwise[^2]. The resulting population of reported results will have an average effect size of around 0.2 population standard deviations [^3].

Now think about what happens when a study like this is replicated. There’s only a 30 per cent chance that the original finding of statistical significance will be repeated. Moreover, the average effect size will be close to the true effect size, which is half the reported effect size.

I don’t think that the results of the replications can be explained this way. At a rough guess, half of the observed failures were probably Type I errors in the original study, and half were Type II errors in the replication.

The broader problem is that the classical approach to hypothesis testing doesn’t have any real theoretical foundations: that is, there is no question to which the proposal “accept H1 if it would be true by chance only 5 per cent of the time, retain H0 otherwise” represents a generally sensible answer. But, we are stuck with it as a social convention, and we need to make it work better.

Replication is one way to improve things. Another, designed to prevent the kind of tweaking pejoratively referred to as ‘data mining’ or ‘data dredging’ is to require researchers to register the statistical model they plan to use before collecting the data. Finally, and what has been the dominant response in practice is to disregard the “95 per cent” number associated with classical hypothesis testing theory and to treat research findings as a kind of Bayesian update on our beliefs about the issue in question. If we have no prior beliefs one way or the other, a rough estimate is that a finding reported with “95 per cent” confidence is about 50 per cent likely to be right. Turning this around, and adding a little more scepticism, we get the provocative presentation of Ioannides “most published research results are wrong”

[^1]: Which will probably include an error, since I’m prone to them, but a fixable one, since the underlying argument is valid.

[^2]: In reality, a more common response, especially with nearly-significant results, is to tweak the test until it is passed.

[^3]:I eyeballed this because I was too lazy to look up or calculation the truncated mean for the normal, so I’d appreciate it if a commenter would do my work for me

{ 174 comments }

1 david 09.02.15 at 6:25 am: Apropos.
2 Hidari 09.02.15 at 7:36 am: From the article above:

‘Bayesian statistics, alone among these first eight, ought to be able to help with this problem. After all, a good Bayesian should be able to say â€œWell, I got some impressive results, but my prior for psi is very low, so this raises my belief in psi slightly, but raises my belief that the experiments were confounded a lot.â€’

See also: http://www.andrews.edu/~rbailey/Chapter%20two/7217331.pdf

Oh and finally: as a piece de resistance: https://en.wikipedia.org/wiki/Quantum_Bayesianism

From what I can tell, all, or almost all of the problems in the social sciences (and, as the Wikipedia article above implies, the hard sciences) stem from their adoption of ‘frequentist’ approaches. The problem is frequentism: the solution is Bayesianism.

Thoughts?

(PS: see also https://en.wikipedia.org/wiki/Instrumentalism)

(Please note that both of these ideass are either implicit or explicit in the OP so I hope I’m not dragging the thread OT).
3 david 09.02.15 at 7:54 am: No – with Bayesianism one would still have to rigorously nail down a consensus model and some plausible range of prior probabilities.

That’s difficult, of course (much like monitoring appropriate p-values in frequentism is difficult). So you’d just have the totemic use of Bayes factors instead of p-values.
4 PlutoniumKun 09.02.15 at 8:19 am: E.O Wilson got slammed a few years ago for arguing that there is too much focus on mathematical skills in the sciences (especially life and social science), but I think his point was misunderstood. There seem to be a lot of researchers out there who know enough statistics and maths to carry out a competent study, but without the deep specialist knowledge to identify systematic flaws in the standard techniques.

I think there is a strong argument to be made across the sciences for specialist statisticians to be used both as co-authors and peer reviewers in a whole range of subjects. The old saying ‘a little knowledge is a dangerous thing’ might seem a facile cliche, but sometimes there is truth to it – it might be better for a lot of researchers to simply acknowledge that the use of statistics is a highly specialised subject, and they should defer to the experts, rather than using it as a standard tool in research.
5 faustusnotes 09.02.15 at 8:42 am: I would dispute this sentence, John:

Under best practice 20th procedure, the experimenters would report the effect if it passes the standard test for statistical significance, and dump the experiment otherwise

This isn’t best practice 20th century procedure, it is common 20th century procedure and I would argue it remains common 21st century procedure too. Putting weight on this aspect of the problem sets up a false dichotomy between frequentist and Bayesian approaches, when either approach done poorly can lead to replication problems, or falsely retaining the null.

I’m also not convinced by your position that p-values don’t provide “sensible” answers. If your question is “Does this drug kill people” or “Does this drug lengthen survival times” the classical null/alternative approach works just fine, and I think there are very good reasons why alternative methods (such as Bayesian stats) aren’t used in pharmaceutical trials as much as might be expected under this philosophical framework. Whether the framework is suitable for psychological studies in general is a good question, but before we answer that we should be asking ourselves whether we are measuring things like depression, addiction, intelligence and other psychological concepts adequately. I think this replication study might be trying to get at some of those deeper problems in psychological research, rather than worrying about the statistical methods specifically…
6 Thomas Lumley 09.02.15 at 8:45 am: The standard deviation you’re looking for in footnote 3 is 0.35.

However, if you take into account that the result could be statistically significant in the wrong direction, you get a bimodal distribution with sd 2.3
7 Thomas Lumley 09.02.15 at 8:53 am: Sorry, it’s actually about 0.5, and only goes up to 0.54ish taking both tails into acvount
8 Manta 09.02.15 at 9:11 am: Quigging, your explanation would apply to all sciences, correct?

If you are right, a similar replication project in, say, particle physics would also fail to replicated 2/3 of the experiments published in the top journals.
9 Metatone 09.02.15 at 9:21 am: I’m not sure how the logistics of PlutoniumKun’s (@4) approach would work, but there’s definitely something in it, particularly for younger researchers. A couple of hurriedly passed stats classes aren’t really a good grounding for complicated experimental setups.

Of course, key here, both regarding improved setups and the issues around publishing is less distortionary resourcing. Perhaps at the deep level this is about acknowledging that in many fields “low hanging fruit have been picked for now*” and that the current incentives made sense in a world of low hanging fruit, but now need to change.

*If we get a paradigm shift, funding methods could revert to the gold-rush mentality…
10 Akshay 09.02.15 at 9:26 am: Manta@8: no, particle physics requires p-values of 0,0000003 to claim discovery, not p=0,05
11 Akshay 09.02.15 at 9:29 am: argh, fixed link
12 Manta 09.02.15 at 9:35 am: Ah, thank you Akshay
(however, your links send me to a 404 page. I think the right link is http://blogs.scientificamerican.com/observations/five-sigmawhats-that/)

Let me amend my question then: those scientific disciplines using p=0,05 convention should have a similar replication crisis? (which are those, btw?)
13 faustusnotes 09.02.15 at 9:52 am: My guess is that the replication crisis in psychology is much more about the experimental design than the stats. If you do a good quality, well-administered sample of a very large number of people with no drop outs, and analyse it well, the risk of type 2 error is so small for any effect size you care about that it will barely register.

The problem is that psychological studies often use highly biased samples (e.g. students) with very dubious measures of outcomes and exposures (e.g. poorly validated scales measuring depression), have high rates of missing values and likely reporting bias, and usually not a large sample size either because the phenomenon being investigated is rare (say, suicide attempts) and/or hidden (e.g. sexual abuse). So any two experiments that are ostensibly the “same” actually are likely to be completely different.

If you just consider the validity of a scale: if it has internal validity of say 0.85 on some criterion like Cronbach’s alpha, then even if its external validity is high if you administer it to two different samples and the phenomenon under investigation is borderline significant, two reasonably sized samples will fall either side of the significance barrier just because of the imprecision of the instrument – even if that insturment is valid across all study groups and populations.

This is much less likely to be a problem with e.g. a pharmaceutical study of a new HIV treatment because HIV load is a clear outcome with a defined measurement process, the drug titre is a direct measure of the drug, etc. It’s a unique problem of psychological research and it is not primarily a problem of the statistical analysis framework.
14 Hidari 09.02.15 at 10:48 am: ‘No â€“ with Bayesianism one would still have to rigorously nail down ….a plausible range of prior probabilities.’

Well not really. I’m a subjective not an objective Bayesian, so your priors are just your priors. It’s just a guess based on what we all sorta think might be the case. So in the example linked to (@1) we already know, a priori, that telepathy is extremely unlikely, as our prior is set low. The precise numeric value doesn’t really matter so much as long as it reflects that, and we gather enough data.

‘ If your question is â€œDoes this drug kill peopleâ€ or â€œDoes this drug lengthen survival timesâ€ the classical null/alternative approach works just fine.’

No the ‘classical’ approach absolutely does not work fine, especially not in ambiguous scenarios like the second. The reason that frequentist approach is so popular is not because it ‘works’ but because, philosophically, it fits in with the Anglo-Saxon ‘model’ of how science ‘ought to’ work and Bayesianism (especially ‘subjective’ Bayesianism) doesn’t. Same with the Copenhagen Interpretation/Quantum Bayesianism: physicists don’t like it because they don’t like what it implies, not because of what it says, per se.
15 Scott Martens 09.02.15 at 10:50 am: Bayesian updating doesn’t really solve the problem for you if your priors are wrong, as pointed out in @3. And among the major problems with devising priors is how they wouldn’t be priors if they had a reasonable theoretical basis from a Bayesian standpoint. And even if we set that aside and we take Bayesianism at its most abstract – given enough datapoints, I will probably eventually converge on a correct probability assessment – you can still only figure out if your effect is real through aggressive testing for replicability.

I’m not sure going that way is any better than saying “p>=0.05 is too high a threshold given modern experimental conditions” and demanding p be something better suited to your actual circumstances.

In machine learning, we accept p-scores that are a lot higher, but make fewer strong claims about the existence of effects. If I find a statistical correlation between the English “blogger” and French “bloguer” in aligned translation data, I’m willing to accept that translation even with p-scores way over 0.05. Machine translation, however, can function adequately that way, and lives are not really at stake if my MT systems gets stuff wrong sometimes. But if you do data mining, and you have a massive database that you test using a large number of techniques correlating a large number of variables, you need to either seriously hedge your conclusions, or demand p-score a hell of a lot smaller than 0.05. I justify loose measures of correlation in MT via a practical argument not really based in statistical reasoning: I need to produce some translation, and I want to produce the best one I can, even if it’s crap. But a lot of modern science is big data mining one way or another, and as Ioannidis pointed out, there a lot of ways to get spurious effects to register as significant.
16 Robert the Red 09.02.15 at 11:00 am: With regards to p-values, also see

Calibration of p Values for Testing Precise Null Hypotheses.
T Sellke, MJ Bayarri, and JO Berger.
The American Statistician v.55:62-71, 2001.

where a p=0.04 value (say) is shown to correspond to a 26% chance the null hypothesis is correct (in several different senses).
17 Robert the Red 09.02.15 at 11:01 am: Ugh, link didn’t show up. Didn’t format it correctly, I suppose.
http://www.stat.duke.edu/courses/Spring10/sta122/Labs/Lab6.pdf
18 Hidari 09.02.15 at 11:03 am: ‘Bayesian updating doesnâ€™t really solve the problem for you if your priors are wrong, as pointed out in @3. ‘
???

The concept of a ‘wrong’ prior is incoherent. As I pointed out, from a subjective Bayesian viewpoint, your prior is just your prior. Mine is different from yours, yours is different from your Aunt Sally’s and so it goes. Probability doesn’t ‘exist’ in any objective sense (i.e before we start to ‘test’ it) so, to repeat, the concept of a ‘wrong’ subjective prior makes no sense. Likewise converging on a ‘correct’ probability assessment. Likewise, as an instrumentalist, discussing whether or not the effect is ‘real’ makes no sense (i.e. to me….although this is a slightly different issue. Bayes’ theory clearly relates more clearly to instrumentalism than does frequentism though, equally, it is not quite the same idea).

Incidentally here’s a blast from the past for all you frequentists out there:

http://ist-socrates.berkeley.edu/~maccoun/PP279_Cohen1.pdf
19 faustusnotes 09.02.15 at 11:21 am: hidari, can you explain why you think a p-value approach doesn’t work well in the case of testing whether a drug kills people or not, or the “ambiguous” case of comparing survival times? Preferably without assuming anyone here is either a “frequentist” or a “bayesian”.

Also hidari, I would have thought that e.g. a gaussian prior for a binomial distribution, i.e. a prior that allows a negative probability, would be “wrong” in some objective sense …?
20 Scott Martens 09.02.15 at 11:56 am: Hidari, trivially, if your prior assigns no probability to the correct hypothesis, no amount of updating can converge on anything like a correct answer. And a prior like “assume all is equally likely” will lead to you asserting that Obama might be a Kenyan-born Muslim. Taking pure Bayesian subjectivism at its word requires you to provide extra logic to justify using probability to inform decisions at all. Better to lose the pure subjectivism and acknowledge that probability calculations take place in a framework of prior knowledge and a context of construction that can just plain be wrong and requires some other justification.
21 casmilus 09.02.15 at 12:14 pm: See Colin Howson’s defence of Bayesianism in “Hume’s Problem”, he deals with some of the issues cited in comments.
22 Dan Riley 09.02.15 at 12:37 pm: @Manta, @Akshay,

There are more fundamental reason why particle physics shouldn’t have a similar replication problem, or at least not to the same degree.

Reason 1: we publish negative results. For an example, take a look at the public results page for the CMS LHC experiment. Every title there that starts with the words “Search for” is a negative result: we looked for something, we didn’t find it, we set an upper limit on the rate.

Reason 2: we don’t p-hack. Standard best practice in particle physics is to work out the entire analysis procedure before looking at the real data, and to “blind” the data analysis so that the final result isn’t known until after the analysis procedure is fixed. We’re also generally relatively rigorous about or statistical analyses, particularly wrt multiple comparisons

Reason 3: we routinely replicate. Generally, any major result needs to be seen by at least two independent experiments before being accepted (e.g., for the Higgs Boson observation, both the ATLAS and CMS experiments had to see it).

We do have the luxury of being able to make very strong assumptions about statistical independence, and often have very large statistics to work with. Fields that work with biological (or, far worse, human) subjects generally have it a lot harder wrt sample size and statistical independence. Nevertheless, registration of experiments, a less dogmatic approach to p-values, and a generally more sophisticated approach to statistics would help those fields a lot, IMO.
23 Hidari 09.02.15 at 12:44 pm: ‘Hidari, trivially, if your prior assigns no probability to the correct hypothesis, no amount of updating can converge on anything like a correct answer.’

OK I am going to stop here, but just to remind you, I am an instrumentalist. The word ‘correct’ has no meaning in my scientific Weltanschauung. Also, I do not believe that ‘probability’ actually exists (in the same way that tables and chairs exists) so again, talking about true or false probabilities again has no meaning. According to my way of looking at things.

I am going to stop here because in my experience, arguments with Realists simply go round in circles because they simply don’t accept that instrumentalism is a meaningful philosophy of science or that anyone can really be a subjective Bayesian. But it is and I am.
24 faustusnotes 09.02.15 at 12:55 pm: Hidari, I don’t know much about this philosophy you’re presenting, but it seems unlikely to be able to get a drug through a regulatory agency. Is it put to use in such circumstances?

Dan Riley, replication is an impossibility in many social science and medical settings. Consider for example the work of Lee Robins on addiction. Replicating it would be almost impossible given the scale of the project and the need for a war to replicate it. When you can’t control experimental assignments, drop out or even the measurement process, in an urgent and important problem, you just need to work with what you’ve got. Fortunately though the questions you need to answer in such cases are usually simple, and elegant statistics are less important than elegant design to manage the biggest sources of bias and confounding.

Which is why I think the problem John is describing is not one of statistical method.
25 The Dark Avenger 09.02.15 at 1:06 pm: A probability is a number. Saying that probability doesn’t exist like tables and chairs do is like saying the number 5 doesn’t exist.
26 SusanC 09.02.15 at 1:32 pm: [My day job involves designing psychology experiments, but take this posting with a pinch of salt anyway…]

I thought replication studies are supposed to use a sample size that is adequate to control the type II errors, and so may often need a larger sample than the original experiment.

That is, in the original experiment you can take the risk of missing a small but real effect because your sample size is too small. [MAJOR CAVEAT: This assumes the only goal of the research is to get a publication out of it. If you actually want to know if a drug has a side-effect of killing people, you might want to be more careful. (“We gave the drug to Bob (N=1) and he seems OK” may not cut it).]

But if you’re going to publish a paper saying. “we tried to replicate this result but we didn’t find it”, you need to guard against the possibility (type II error) that the effect is real, and the original authors were right. So larger sample may be needed.
27 Scott Martens 09.02.15 at 1:57 pm: @Hidari: “I am going to stop here because in my experience, arguments with Realists simply go round in circles because they simply donâ€™t accept that instrumentalism is a meaningful philosophy of science or that anyone can really be a subjective Bayesian. But it is and I am.”

Probably a good idea. Once you start telling me that it’s your subjective belief that airplanes fly (because, hey, maybe they don’t) I will not take you seriously.

@The Dark Avenger: I don’t see any fives, point one out to me. Hidari is a subjectivist, and I have some trouble with that, but that doesn’t make me some Platonic mystic who believes in numbers. I’d take Xenu over the number 5 any day.
28 temp 09.02.15 at 2:03 pm: The probability of Type I error in psychology is zero. The brain is a complex system; everything has an effect on everything else.
29 Hidari 09.02.15 at 2:08 pm: ”A probability is a number. Saying that probability doesnâ€™t exist like tables and chairs do is like saying the number 5 doesnâ€™t exist.’

The number 5 doesn’t exist in the same way that tables and chairs exist. Even Plato didn’t think that. (In fact, especially Plato).
30 Scott Martens 09.02.15 at 2:11 pm: @Hidari “Even Plato didnâ€™t think that. (In fact, especially Plato).”

I see that we do agree on something.
31 Bruce Wilder 09.02.15 at 2:14 pm: A helpful visualization

http://rpsychologist.com/d3/NHST/
32 Scott Martens 09.02.15 at 2:21 pm: @Bruce Wilder: Your infographic misses the best quote of all:

Small wonder that students have trouble [with statistical hypothesis testing]. They may be trying to think.

— W. Edwards Deming
33 Bruce Wilder 09.02.15 at 2:22 pm: John Quiggin: The broader problem is that the classical approach to hypothesis testing doesnâ€™t have any real theoretical foundations: that is, there is no question to which the proposal â€œaccept H1 if it would be true by chance only 5 per cent of the time, retain H0 otherwiseâ€ represents a generally sensible answer. But, we are stuck with it as a social convention, and we need to make it work better.

faustusnotes: When you canâ€™t control experimental assignments, drop out or even the measurement process, in an urgent and important problem, you just need to work with what youâ€™ve got. Fortunately though the questions you need to answer in such cases are usually simple, and elegant statistics are less important than elegant design . . .

This topic is certain to generate some remarkable topics for zen meditation.
34 Hidari 09.02.15 at 2:27 pm: @24 https://en.wikipedia.org/wiki/Bayes%27_theorem#Drug_testing

This is the difficulty I have with most of these discussions. Most people wilfully decide not to know anything about the problems with NHST (and it is wilful…as the Cohen paper I linked to demonstrated, we have known the problems with frequentism since at least the 1930s) and then I get asked bizarre questions with phrases like ‘this philosophy you are representing’ as if I am Thomas Bayes or if I personally invented the concept of instrumentalism and both theories are used solely by me.

If anyone cares, the Bayesian approach to probability (subjective) is sketched out in Nate Silver’s The Signal and the Noise.

http://www.amazon.co.uk/The-Signal-Noise-Science-Prediction/dp/0141975652
35 Bruce Wilder 09.02.15 at 2:30 pm: Is it not possible to think that the concept of chair exists in the same way as the concept of the number 5, and that actual chairs have the same relationship to the concept of chair as objects of things in sets or counts of 5 have to the concept of 5.
36 RJB 09.02.15 at 2:31 pm: The issue is larger than just shortcomings of statistical analysis and replicability. Fortunately, accountants are on the case! Here is a Call for Papers for the 2017 Journal of Accounting Research Conference soliciting proposals for registered reports. Deadline for proposals is November 1; a wide range of methods and topics are encouraged, not just experiments (as with most Registered Reports) and not just accounting.

From the details (pdf) here is information about the Registration-based Editorial Process (REP) they are using:

REP operates as follows. Editors determine whether the initial proposal submission is promising enough to be sent to one or more referees for review, and if so, use the review(s) to determine whether the proposal should be rejected, returned to the authors for revision, or approved. Approval letters spell out the conditions under which they will accept the second stage report for publication. These conditions will always require that authors fulfill their commitments to gather and analyze data as proposed, and never require that the results support any particular conclusion (such as the stated hypotheses). However, editors may also include other conditions to address specifically identifiable concerns about the informativeness of the data or the thoroughness of the additional analyses. To the extent possible, conditions will be crafted to allow authors to guarantee publication simply by living up to commitments under their control. Authors can withdraw an approved proposal at any time, but cannot submit their resulting manuscript to another journal until they do so.

By allowing editors and authors to agree on conditions for publication early on, REP encourages authors to make larger investments in research studies, enhances the reliability of the resulting report, and accelerates the input authors receive so they have it when they need it most (before they gather data):

â€¢REP Encourages Investment. Without the commitments embodied in REP, authors are effectively creating research â€œon specâ€, speculating that the end result will be attractive to an editor. Authors are wise to limit their investment in such circumstances, since they cannot easily predict either editorsâ€™ tastes or the end result of the study. The acceptance decision may also be far in the future. REP encourages authors to propose studies that are more ambitious (e.g., in the scope of their data gathering, by deviating from conventional tastes) because they can defer the bulk of their investment until after an editor has committed to publish the end result.
â€¢REP Enhances Reliability. In the traditional editorial process, authors offer their completed research results to editors for evaluation. This leaves authors with both the incentives and the ability to overstate their results by choosing statistical tests that indicate strong support for their predictions, and revising those predictions to make their theory seem more powerful. REP reduces the incentives and ability to pursue such strategies, by making publication of an accepted proposal contingent on whether the author lived up to their commitments to gather data and analyze it appropriately, rather than on the outcome of those results.
â€¢REP Accelerates Input. An editorâ€™s goal is to publish good papers. Editors accomplish this goal partly by making wise decisions about which papers to publish and which to reject, and partly by providing input that helps authors improve their initial submissions. Such input is often very painful in the traditional editorial process, because it comes after the author has made crucial and sometimes irrevocable decisions on how to gather data. REP provides authors with input before they gather data, improving the likelihood that the editors (and authors) can publish good papers.
37 Nick 09.02.15 at 2:32 pm: @Scott Martens “Probably a good idea. Once you start telling me that itâ€™s your subjective belief that airplanes fly (because, hey, maybe they donâ€™t) I will not take you seriously.”

Airplanes fly by definition. If they didn’t, they wouldn’t be airplanes. As to whether the airplane you’re currently asleep in is still flying…how would you know exactly?
38 TM 09.02.15 at 2:37 pm: JQ’s analysis is fundamentally sound but he doesn’t spell out the real lesson we should take from it: to really scientifically confirm a phenomenon, one replication is still not enough (and zero replication is just bizarre – a one time observation can never justify a scientific theory). In the hard sciences, experiments are replicated many times. If replication were a routine procedure in psychology, we wouldn’t be guessing whether the errors are type I, type II are more fundamental experimental blunders. We would have a fairly good idea which results have stood the test of science and which haven’t.
39 faustusnotes 09.02.15 at 2:39 pm: Bruce, to elaborate: all the fancy Bayesian credible intervals and careful prior selection you can throw at a problem are worth nothing if the experiment is poorly designed and the data badly collected. But if your experiment is well designed and your data well collected – e.g. an RCT with a large sample size, careful patient monitoring and a carefully selected, representative sample – you will have little more to your task than comparison of means. Sure you could do this with a suitable prior and a Gibbs sampler, start from multiple chains etc. [a topic which is itself in some dispute still], develop a complex posterior distribution and solve it analytically, whatever … but if your sample size is large and your experiment well designed, you could just calculate the 95% confidence interval of the means and compare it – or do a t test. You’ll get the same answer either way [these large-sample limit similarities and the diminishing role of the prior in large samples are points made early on in BDA].

The problem with replicating psychology studies does not lie in the quality or elegance of their statistical methods in general, but in the nature of the experiments and their design. I’m waiting to see if John clarifies the sentence you’re referring to, but ultimately I think he’s laid out a furphy here.
40 faustusnotes 09.02.15 at 2:52 pm: TM how do you propose to replicate an experiment into e.g. the Australian gun buyback scheme? It’s of considerable public policy interest internationally, statistical analysis will require 20 or more years of follow up from the original intervention, and it has happened precisely once. To replicate it you need a counter-factual Australia or a real second intervention in a similar country and another 20 years. There’s a mass shooting every day in America – waiting multiples of 20 years for definitive proof that a given intervention will work isn’t a luxury anyone has.

Sure you could speed up the process of answering this question by a Bayesian analysis with a carefully selected prior (e.g. one that assumes an effect of the intervention). Do you think that an analysis with a pre-selected prior favouring a controversial intervention is going to improve its acceptability to wavering policy-makers in other countries?
41 TM 09.02.15 at 3:04 pm: Wrt to hypothesis testing, the difference between significance and effect size is often misunderstood. Depending mostly on the sample size, an effect can be real and large yet not appear significant, while an effect so small it is practically irrelevant can appear highly significant in the statistical analysis. Sometimes, authors, having produced a statistically significant effect, neglect to report effect size or leave it somewhere in the small print. Secondary reports very often omit effect size. This is totally misguided. The p-value is a statistical artifact, a useful tool but no more. The effect size is what really matters. Scientific convention has pushed researchers to look for (and sometimes manufacture) statistical significance rather than real world relevance.
42 TM 09.02.15 at 3:18 pm: 40: my understanding is that the psychology replication study concerns lab experiments. There is no reason why there couldn’t have been more much more aggressive attempts at replication. Field studies do of course pose a whole different set of problems.

23: I’m not aware of anybody who professes to believe that “â€˜probabilityâ€™ actually exists (in the same way that tables and chairs exists)”.
43 TM 09.02.15 at 3:36 pm: Also, gravity may not “exist in the same way that tables and chairs exist”. That doesn’t invalidate the scientific concept.
44 JK 09.02.15 at 4:09 pm: At a rough guess, half of the observed failures were probably Type I errors in the original study, and half were Type II errors in the replication.

Huh? The original studies were almost certainly affected by lots of publication bias and “p-hacking”, while the replications were more powerful statistically (larger sample sizes) and not subject to publication bias (all were preregistered and published). I don’t see how half of the replication failures could be due to Type II errors.
45 Manta 09.02.15 at 4:19 pm: @TM From my (poor) understanding of quantum mechanics, it’s probability amplitudes that actually exist: tables and chairs (and probabilites) are artifacts of our brain…
46 The Dark Avenger 09.02.15 at 4:51 pm: 5 isn’t an objective fact, it’s a measure of reality. If I drop a coin from my hand , the P=1. The P is real, just as a yardstick with numbers is real.
47 The Dark Avenger 09.02.15 at 4:53 pm: That is, the P of,it hitting the ground is equal to 1.
48 TM 09.02.15 at 5:23 pm: 44: good point.
49 Snarki, child of Loki 09.02.15 at 5:29 pm: @DarkAvenger: “the P of,it hitting the ground is equal to 1.”

No it’s not. P~1-(1E-24), where the difference from unity comes about when the coin freezes solid and shoots into the air. Thermal motion.

Feel free to give me 1e24 coins, and I’ll test it for you.
50 Dogen 09.02.15 at 6:26 pm: Hidari@34: I read Nate Silverâ€™s book and was disappointed in his description of his â€œBayesianâ€ methods. All he seems to be doing is using standard conditional probabilities, (and so does the Wikipedia excerpt on drug screening.) I didnâ€™t find anything remotely controversial to a â€œfrequentistâ€ in either reference, just standard probability theory. What am I missing here?

My (very very poor) understanding is that current state-of-the-art Bayesianism is all about model building rather than Null Hypothesis Testing. Am I wrong on this?

And with respect to the original topic, I think the problem with the psychological studies (and likely a lot of medical studies) is a confusion between investigatory statistics and what one might call confirmatory statistics. Call it the â€œgarden of forking pathsâ€ or â€œdata miningâ€ or â€œp-hackingâ€ or whateverâ€”this is why physicists value replications and why other sciences should figure out how to value them as well.
51 The Dark Avenger 09.02.15 at 7:48 pm: @Snarki, I apologize, a true statement would be that the P of it hitting the ceiling is mathematically indistinguishable from zero.
52 John Quiggin 09.02.15 at 11:48 pm: Faustusnotes @5 I used “best practice” advisedly, rather than “best possible practice”, to refer to what was actually done by the best and most careful researchers and published by the best journals. As I noted, common practice was (and still is) to run lots of models on the same data and report the runs that “worked”. At least in the social sciences, I’m not aware of any field where the leading journals followed, in C20, what we might regard as “best possible practice”, such as a requirement for pre-specification combined with a commitment to publish null results, if the tests themselves were well-designed and run.
53 Adam Hammond 09.03.15 at 12:18 am: @Manta
Many publications in biology do not pose hypotheses in terms of probability. If you predict: “organisms expressing mutation A will lack embryonic structure X,” then you don’t ever get into p values (as long as everyone agrees with your binary determination of the existence of X). You could phrase it in terms of a p value – and maybe we should.

My point is that it will be very difficult to tot up which fields use a p<.05 standard.
54 faustusnotes 09.03.15 at 1:16 am: TM, for the reasons I specified above (sample selection, instrument validity, etc) it’s unlikely that even laboratory studies can be replicated in the same way as for physics.

e.g. “German group replicates Australian group’s findings on sub-atomic particle Y” is very different in implication to “German group replicates Australian group’s findings on the relationship between high school bullying and [insert pointless lab result].”

The kinds of findings in psychology that can be obtained from a lab experiment that is universally repeatable are, I would guess, typically not very interesting.
55 Tabasco 09.03.15 at 1:39 am: commitment to publish null results

In the old days, when results were published only in expensive dead trees, journal editors were right to pragmatically not waste space on null results. Those days are gone. Put the ‘final’ results in the dead tree versions and the null results for all to see in cyber space, stored in one of Amazon’s server farms.
56 Bruce Wilder 09.03.15 at 5:33 am: journal editors were right to pragmatically not waste space on null results.

They have become “null”? Interesting slip.

Something has gone wrong in our psychology that overwhelms our deliberate understanding of what we are doing.
57 Nick 09.03.15 at 5:40 am: !!
58 John Quiggin 09.03.15 at 6:37 am: @56 Not a slip. The kind of result I’m talking about is what classical theory refers to as “failure to reject the null hypothesis”, or “null result” for short.
59 Z 09.03.15 at 8:28 am: I believe the main problem is the (largely social) constraint imposed on researchers to publish at the current expected rate. Given this, dubious practices to speed up the process of obtaining a publishable article (p-hacking, lack of replication, weird experimental designs that just happen to produce a significant results and the not sufficiently careful refereeing that allows it to appear) are bound to be ubiquitous.

As evidence for this thesis, consider math. By nature immune to the statistical concerns raised by the Replication project, “most published research results are wrong” (in the technical sense of not having a suitable proof) could be true for it as well (replace p-hacking by the practice of relegating the technical details of the proof to a subsequent publication and poor experimental designs by poor standards of rigor).
60 Manta 09.03.15 at 8:52 am: Z, in maths you are supposed NOT to trust the author’s word and check the proof yourself. Of course, in practice many times you cannot do that, and rely on other people’s checking….

There are (important) authors notorious for publishing wrong “proofs” (or “theorems”, for that matter): these you know you have to hande with extra care.

The practice is often not to relegate the technical details to a subsequent publication, but not to publish them at all (as rule-of-thumb, the words “it’s obvious that” or “it’s trivial that” or “by X,”, where is a word for some set of standard tecniques, most of the time mean “I’ve never bothered to actually check the following statement”).

And finally, the standard of proof and what mathematical rigor is has changed a lot with time. (For instance, infinitesimals were allright in Newton’s formulation of calculus, were banished as non rigorous with Cauchy, Dedekind, etc, and have been brought back as a legitimate tools with Robinson…).
61 Ebenezer Scrooge 09.03.15 at 10:51 am: I used to be a physical scientist, back in my youth. When my data looked like they needed any statistical analysis more powerful than a least squares fit, I concluded that the experiment needed to be redesigned.
Social scientists don’t have this luxury; social science is much harder.
62 SusanC 09.03.15 at 12:02 pm: In real experiments, I usually more worried about the reproducibility of the sampling, rather than type I/type II errors. Increasing N is usually easier than dealing with sampling bias.

For example, in a typical experiment investigating something about autism, you might recruit 30 children with a diagnosis of autism from a special school, and 30 children without an autism diagnosis from a mainstream school. Carefully matched for age, gender, handedness,… of course: you wouldn’t want your experiment to be confounded by your criterion and control groups being at different developmental ages, or the control group being 50% female while the criterion group has more boys than girls.

But suppose someone wants to replicate your result. They’re going to recruit their sample of participants from different schools, at a different time and place. Lots of things could be different about the replication sample that swamp the original effect. e.g. was being able to get into the special school correlated with family socioeconomic status, in a way that differs between the original and replication sample? Even the diagnosis of “autism” isn’t as stable an entity as one might like for replication purposes. It’s a spectrum condition: do the original and replication samples have it to the same level of severity? Was the level of severity of symptoms needed to get a place in the special schools different between the original and replication samples (due to changes in say, local education authority policy, or health insurance policy, or….) etc.
63 Snarki, child of Loki 09.03.15 at 12:39 pm: Main result: “humans make TERRIBLE data points”
64 TM 09.03.15 at 3:31 pm: fn 54, I am aware that people aren’t as replicable as physical particles. But what is the point of doing lab experiments in psychology if you assume from the start that you can’t isolate the phenomenon you are studying? Psychologists are doing lab studies because they want to be taken seriously as scientists. If these supposedly scientific studies aren’t actually replicabe, they need to give up the pretense. The NYT published a truly pathetic apologia for psychology:

“But the failure to replicate is not a cause for alarm; in fact, it is a normal part of how science works.” (http://www.nytimes.com/2015/09/01/opinion/psychology-is-not-in-crisis.html)

The way science works is that if you can’t replicate, you have to assume that your result was a fluke. It is certainly possible that you can come up with a more sphisticated theory or hypothesis that explains both the initial result and the replication failure. But then you have to do a new study to test your new hypothesis. What you can’t do is say that replication doesn’t matter. Or you can, but then you can’t be expected to be taken seriously.
65 Bruce Wilder 09.03.15 at 3:35 pm: JQ @ 58 The slip is the slide along the path of semantic generalization from an arbitrary social convention to the conviction that “null results” contain no information, nothing of interest, which journal editors should waste paper space on.

NHST, as a convention, has this dispositive function in practice, though there is no epistemological basis for it. A well-designed study with a “null result”, in light of a scientific method of elimination, is presumptively as informative. Why are our human minds so easily tricked?
66 Nym w/o Qualities 09.03.15 at 3:45 pm: TM @64: “But then you have to do a new study to test your new hypothesis. ” Indeed, and so careers are built. Isn’t that really the motor of the whole machinery?
67 Manta 09.03.15 at 4:03 pm: It seems to me that psychology has the “clothes” of a science (experiments, p-values, etc) without the “meat” (predictive power and replicability).
Maybe it should be regarded part of humanities, like history or philosophy.
68 mbw 09.03.15 at 4:17 pm: Great OP. On null hypotheses:

Which of these is a good null hypothesis:
1) Our search for resonances in a particular energy range will just find the usual background events?
or
2) Our results will require throwing out the framework of modern particle physics.

Obviously, (1) is a good null. (2) is an extreme alternate. Right?
Except if the experiment was the recent Higgs search, (1) and (2) were in fact the same hypothesis, logically equivalent statement.

To be continued.
69 mbw 09.03.15 at 4:23 pm: Let’s take a genuine real world question. What effect do flu shots have for people over 65 years age?

Which of these is a good null?
1) About the same as for anybody else, preventing maybe 2/3 of serious illness, depending on the year.
2) As with any treatment, the null is always that there’s no effect at all.

The standard hypothesis testing rules make you choose one of these nulls, or maybe some other even stupider one.
Whatever one thinks of those philosophical arguments about Bayes, at least Bayesian approach doesn’t make you throw all your common sense out the window before you start. And yes, common sense is often crap, but the usual version of the frequentist approach requires you to replace common sense with some made-up crap masquerading as rigor.
70 mbw 09.03.15 at 4:32 pm: Faustus asks:

“Iâ€™m also not convinced by your position that p-values donâ€™t provide â€œsensibleâ€ answers. If your question is â€œDoes this drug kill peopleâ€ or â€œDoes this drug lengthen survival timesâ€ the classical null/alternative approach works just fine, and I think there are very good reasons why alternative methods (such as Bayesian stats) arenâ€™t used in pharmaceutical trials as much as might be expected under this philosophical framework. ”

The classical approach is terrible for questions like that. The answer to “does this drug kill people” is just plan yes. E.g. aspirin via hemorrhagic strokes, even pure sugar pills via diabetes, etc. The question is more like “How many people does it kill when vs. how many it saves when?” Giving special privileges to the hypotheses “it doesn’t kill anybody” or even (in most cases that get to big trials) “it doesn’t save anybody” makes no sense. Doing that nonsensical procedure costs lives.
71 Manta 09.03.15 at 4:38 pm: @mbw
Isn’t the null hypothesis in those cases “it cures as many people as placebo”? or “it’s no better than established medicines”?
72 mbw 09.03.15 at 4:44 pm: @manta: So for the 65+ flu vaccine, despite the known efficacy of the vaccine in under 65 groups, the null hypothesis is equivalence to placebo? Or “no better than established medicines” (of which, at the time of the most important studies, there were none?)

Seriously?

By the way, (spoiler alert) the correct answer is:
about 50% effective. Right in between the two implausible nulls.
73 TM 09.03.15 at 4:50 pm: mbw, supposedly, flu shots have already been tested and found effective in the general population. Convention dictates that you are testing version 1, not 2.

Perhaps more to the point: your research question is really to estimate the effectiveness of flu shots in that population. Hypothesis testing isn’t really the answer to that. You are looking for a confidence interval. As I said in 41, there is a lot of confusion between significance and effect size. Significance is just a tool for weeding out random flukes. It’s rarely an end in itself. In most real world situation, you are really interested in effect size. But science education tends to overemphasize the importance of hypothesis testing, to the point that the statistical significance of an effect is seen as more important than its real world relevance. Graduate students in many fields are told that if they can’t express their research questions as testable hypotheses, they are not doing science, and funding agencies won’t fund them. As a result, a lot of nonsensical null hypotheses are produced and certain avenues of research are inhibited.

There’s no shortage of criticism of these tendencies, e.g. Ziliak and McCloskey “The Cult of Statistical Significance” (I’m not endorsing the book – I think HT is useful but agree that its application is often misguided).

Some interesting refs:
https://www.sciencenews.org/article/odds-are-its-wrong
http://www.nature.com/news/scientific-method-statistical-errors-1.14700
http://www.deirdremccloskey.com/articles/stats/preface_ziliak.php
74 TM 09.03.15 at 4:56 pm: Further mbw, your example is a good one to show why students are so easily confused by HT. The fact is that the choice of HT always depends on the context and prior knowledge.
There are no clear and specific rules and many text book examples appear contrived.
75 TM 09.03.15 at 5:01 pm: One basic rule I think all students should be taught: *Always estimate and report a confidence interval for whatever you are trying to measure*. This should be obvious since every HT can equivalently be decided from a CI and the CI has always more information than the HT (which is just binary). I think students shouldn’t be given pure HT problems at all, they should learn to always do a CI, and they should understand why any HT can be decided by a CI.
76 mbw 09.03.15 at 5:04 pm: @TM- I think we’re in agreement. HT is sometimes a useful first screen to decide whether to pay attention to something. As the endpoint of an analysis, it’s nonsense. As the organizing principle for stats education, it’s a disaster.
77 mbw 09.03.15 at 5:09 pm: @HT, BTW there was a period in which the convention in the published literature re flu vax for 65+ was taken to be (2), not (1).
78 TM 09.03.15 at 5:14 pm: 76: Yep.
79 Bruce Wilder 09.03.15 at 5:15 pm: And as a dispositive for publication . . .
80 Hidari 09.03.15 at 6:05 pm: @73 Well this (https://www.sciencenews.org/article/odds-are-its-wrong) is an interesting article, in that it clearly states what some of the other posters in this thread are under the impression are the ramblings of a lunatic (at least when I say it).

An extract: ‘Such sad statistical situations suggest that the marriage of science and math may be desperately in need of counseling. Perhaps it could be provided by the Rev. Thomas Bayes.

Most critics of standard statistics advocate the Bayesian approach to statistical reasoning, a methodology that derives from a theorem credited to Bayes, an 18th century English clergyman. His approach uses similar math, but requires the added twist of a â€œprior probabilityâ€ â€” in essence, an informed guess about the expected probability of something in advance of the study. Often this prior probability is more than a mere guess â€” it could be based, for instance, on previous studies.

Bayesian math seems baffling at first, even to many scientists, but it basically just reflects the need to include previous knowledge when drawing conclusions from new observations. To infer the odds that a barking dog is hungry, for instance, it is not enough to know how often the dog barks when well-fed. You also need to know how often it eats â€” in order to calculate the prior probability of being hungry. Bayesian math combines a prior probability with observed data to produce an estimate of the likelihood of the hunger hypothesis. â€œA scientific hypothesis cannot be properly assessed solely by reference to the observational data,â€ but only by viewing the data in light of prior belief in the hypothesis, wrote George Diamond and Sanjay Kaul of UCLAâ€™s School of Medicine in 2004 in the Journal of the American College of Cardiology. â€œBayesâ€™ theorem is … a logically consistent, mathematically valid, and intuitive way to draw inferences about the hypothesis.â€ (See Box 4)

With the increasing availability of computer power to perform its complex calculations, the Bayesian approach has become more widely applied in medicine and other fields in recent years. In many real-life contexts, Bayesian methods do produce the best answers to important questions. In medical diagnoses, for instance, the likelihood that a test for a disease is correct depends on the prevalence of the disease in the population, a factor that Bayesian math would take into account.

But Bayesian methods introduce a confusion into the actual meaning of the mathematical concept of â€œprobabilityâ€ in the real world. Standard or â€œfrequentistâ€ statistics treat probabilities as objective realities; Bayesians treat probabilities as â€œdegrees of beliefâ€ based in part on a personal assessment or subjective decision about what to include in the calculation. Thatâ€™s a tough placebo to swallow for scientists wedded to the â€œobjectiveâ€ ideal of standard statistics. â€œSubjective prior beliefs are anathema to the frequentist, who relies instead on a series of ad hoc algorithms that maintain the facade of scientific objectivity,â€ Diamond and Kaul wrote..â€ ‘
81 TM 09.03.15 at 6:17 pm: Well I don’t think Bayesians are lunatics. But I disagree with the author (whom I referenced) when he says: “Standard or â€œfrequentistâ€ statistics treat probabilities as objective realities”. An observed frequency is indeed an objective reality, which explains its appeal with empirical scientists, but “probability” is a mathematical concept that has a different philosophical status. It is easy to conflate these since we use frequencies to estimate probabilities but it’s not the same. If I can use probability to accurately predict an empirically observed frequency, that to me indicates that probability is a scientifically useful concept. Of course most practitioners don’t care about the philosophical nuances but I must object to the suggestion that frequentist approaches per se are philosophically naive.
82 mbw 09.03.15 at 6:29 pm: @Hidari- As you can see, my approaches to most problems are Bayesian. It’s not wise to get too dogmatic about that approach, however. E.g. what do you make of the rigorous argument by Wasserman and Robins that for an important class of real-world high-dimensional problems, no Bayesian estimator converges with useful rapidity, while there is a simple frequentist estimator that works quite well.
83 Dogen 09.03.15 at 9:04 pm: I also donâ€™t think Bayesians are lunatics. I greatly admire Andrew Gelman for instance.

But I do think these two sentences (from the article quote from Hidari above) are lunatic:

“In many real-life contexts, Bayesian methods do produce the best answers to important questions. In medical diagnoses, for instance, the likelihood that a test for a disease is correct depends on the prevalence of the disease in the population, a factor that Bayesian math would take into account.â€

Letâ€™s be crystal clear, the phrase â€œthe likelihood that a test for a disease is correct depends on the prevalance of the disease in the populationâ€ is totally non-controversial, and also totally standard. Conditional probability must be taught in every introduction to probability class and this is a straightforward application of it. I canâ€™t imagine why anyone would call this â€œBayesianâ€. Yâ€™all (all you strict Bayesians) need a better go-to example.

The follow-on phrase “a factor that Bayesian math would take into accountâ€ implies that a non-Bayesian would not take this into account. Thatâ€™s lunacy. (I donâ€™t doubt there are people who make this type of error. But itâ€™s just an error, not something that defines Bayesianism fer cryinâ€™ out loud.)

Now, if you want to talk about the true philosophical meaning of â€œprobabilityâ€, measure theory, or whatever, thatâ€™s a different and potentially very interesting discussion.

I personally believe probability is an objective reality but is often badly describedâ€”and the bad descriptions can mislead badly. For example, consider the coin thatâ€™s been tossed and is hidden under the tosserâ€™s hand. People usually say something like â€œthe probability that itâ€™s â€˜headsâ€™ is .5â€ Thatâ€™s nonsense of course, the probability is either 1 or 0 since the event has already happened…
84 mbw 09.03.15 at 10:09 pm: @Dogen. OK, let’s do the standard example:
old HIV test, 95% accurate either way, population (say all male residents of Toledo between 20 and 30 yr old) with 1% prevalence. etc. The usual answer that you know, about 1/6 chance that a positive screening result on someone from that group means actual HIV.

Now your screening accidentally picks up a guy 31 yr old. What would a positive result on him mean? (formally, not enough info to say anything?)

Or you notice needle marks on the arm of one guy. What do you think the his positive result means? (too much info?)

Or You start testing in Akron but don’t have prior population results there.
etc.

This is the real world of decision-making.
85 Bruce Wilder 09.03.15 at 10:41 pm: OK, frequentists and Bayesians, what say thee about this:

http://www.nytimes.com/2015/09/03/health/insurer-says-clients-on-daily-pill-have-stayed-hiv-free.html

Kaiser, with a huge database, mines said database and offers a judgment backed by the information.

Previously, a formal test was called off after several people were infected by a serious disease. The “power” of the formal test is praised.

I tend to think people, who dress up with statistics too often have weak theories and strong opinions, and that combination does not always work out well. What that has to do with probability and the philosophy of inference, I leave to others.
86 mbw 09.03.15 at 11:12 pm: Nice data. It strongly indicates that the drug makes a big reduction in HIV transmission- maybe by more than a factor of 10.
It’s easy to deal with sloppy inputs like this from a Bayesian viewpoint. I don’t know what a rigid frequentist would do with it, but perhaps one will write in.
87 Dogen 09.03.15 at 11:28 pm: @mbw

Much better examples. I’m assuming you agree the ones I cited are not?

And I didn’t find any good examples in Nate Silver’s otherwise excellent book.

@Bruce Wilder

Looks like everyone is in agreement in the results since they stopped the clinical trials.

I don’t qualify as a statistician of either stripe but my impression is that there is much less controversy between the camps than there used to be.
88 John Quiggin 09.04.15 at 12:28 am: Interestingly, in both game theory and decision theory, Bayesian reasoning is the orthodoxy. Challenges to the orthodoxy come from
(a) Claims that no meaningful probability can be assigned to some events (the big name here is Daniel Ellsberg, who was also a prominent figure in US politics for a time in the early 1970s)
(b) Unawareness of some possible events sometimes called “black swans”. This implies a prior probability of zero for these events, which creates big difficulties for Bayesianism, as Scott Martens mentions above.

But hardly anyone suggests that these criticisms represent an argument for classical statistics.
89 christian_h 09.04.15 at 1:28 am: Interesting post and discussion, but can I just repeat something Dogen and others have pointed out: there is nothing particularly “Bayesian” in employing conditional probabilities. In the article JQ linked arguing that a large proportion of research findings are “false” (scare quotes used in a likely doomed attempt to avoid arguments over the nature of “truth”) I see nothing that a frequentist statistician would disagree with, for example.
90 mbw 09.04.15 at 1:37 am: @ Dogen #87. Right. Straight conditional probabilities (as in the standard intro screening exercise) don’t require Bayes. Real-world use usually does.
91 Bruce Wilder 09.04.15 at 1:47 am: christian_h @ 89

The problem of the OP is not the people, who have an opinion on the meta-issues; the problem is the people, who do not, and just want to go thru the motions of some procedure, and be justified.

In a strange sort of way it reminds of ethics and Christianity.
92 faustusnotes 09.04.15 at 2:16 am: In my opinion there are a few misconceptions here. To address one or two …

mbw at 70, aspirin kills lots of people so classical stats failed: we know that aspirin kills through hemorhage precisely because of classical tests that showed higher rates of incidence. Nothing secret there: p-values proved your point.

mbw at 69, flu shots and hypothesis tests vs. CIs: there’s no reason not to do a hypothesis test and report a CI as well (in fact most journals require it). In the case of flu, there’s good reason to think the vaccination may be less effective than in other age groups but because the risk of flu is higher in the elderly we need to adjust our vaccination policy to account for effectiveness. Test against the 2/3 null, reject if necessary, then report the actual effectiveness with CI that can then be used in assessing policy.

Bruce Wilder re pre-exposure prophylaxis: I don’t see the big deal here. Prep has been well understood for years, the question is whether it is effective after adjusting for risk compensation and whether it is cost-effective on top of test-and-treat strategies. Risk compensation can’t be replicated easily in labs and cost-effectiveness requires good data. If Kaiser have added to this information that’s great but the primary issue here isn’t Bayesian vs. frequentist false dichotomies. As I keep saying it’s about experimental design.

Lots of people re: 95% CIs: CIs and t-tests are formally equivalent, and using one instead of the other is not philosophically different. In the case of a Bayesian credible interval, yes you have incorporated a prior but you are still doing philosophically the same thing as you do when you do a standard CI comparison. In fact if you look at how Bayesian stats are generally used to answer practical problems, what they do is report a credible interval and say “since these values are outside the interval, we conclude that …” They’re doing the same thing with a different numerical result.

This is easily seen in some of the basic examples. If you use a simple non-informative prior for a simple estimate of a mean and its credible interval, for example, you’ll usually get a formula for the 95% CI that is very similar to the classical formula but with an extra bit, n-weighted. This bit declines to zero with large n, indicating that the two CIs converge in the large sample limit (as they must). This convergence raises problems for the classical contrast between “frequentist” and “Bayesian” CIs. Gelman elegantly explains the contrast and why Bayesian CIs are more intuitive, but to the best of my knowledge he doesn’t have a good explanation for the implication from the large sample limit that the two types of CI are essentially the same. This might also be why everyone in practice treats “frequentist” 95% CIs as if they were Bayesian (i.e. interpreting values near the middle of the CI as more likely in some sense than those near the fringe).

The problem I see with Bayesian stats as it is used in practice is that it is usually presented in almost exactly the same way as Frequentist stats – i.e. a credible interval is used to draw conclusions about what it means to be outside the interval. I think this is because scientists don’t have many other ways of coming to conclusions but to accept/reject (hence my dispute with John’s presentation of classical hypothesis testing). It’s a natural way of thinking: “since we gave people this drug their strokes stopped happening, the drop is way too big to be just luck, something’s working.”

So two requests for the “Bayesians” commenting here:

1. please stop using “frequentist” as an insult and making false dichotomies (e.g. mbw’s throwaway “strict frequentist” comment): this isn’t an R message board, there’s no need to be an arsehole;

2. show how you would approach the question of whether a drug is effective in a way that is fundamentally different to a “frequentist” approach. i.e. don’t tell me that your CI would be slightly different, but tell me how you would use that CI in a way that is fundamentally different to a classical drug trial.

I think you can’t!
93 hellblazer 09.04.15 at 2:48 am: Delurking to ask: does Cosma Shalizi still read or contribute to Crooked Timber?
Cf. this old blogpost, which has among other things a link to his paper with Gelman (apologies if this has been brought up implicitly upthread).
94 TM 09.04.15 at 2:52 am: Does it really need to be spelt out that Bayes theorem is a standard result of “classical” (sensu 88 for want of a better term) probability theory and there is nothing the least bit inconsistent between Bayes theorem and frequentism? Formally you can easily express it in frequentist terms. In fact, in the examples about HIV prevalence for example, what you are talking about ARE frequencies (there is a definite number of HIV positive people within a certain population – that is a frequency; and so on). The suggestion that frequentism “overlooks” conditional probabilities is pure nonsense as Dogen and christian are correct to point out. The real issue as far as I can see is in statements such as the one quoted in last paragraph of 80, which most empirical scientists won’t swallow.

Ans since Nate Silver has been mentioned several times, remember this much debunked example of “Bayesian analysis” in action from fivethirtyeight:
http://fivethirtyeight.com/features/a-formula-for-decoding-health-news/
(It seems that the comments were removed but see https://ksj.mit.edu/tracker/2014/03/nate-silvers-new-fivethirtyeight-dishes/ or http://www.cjr.org/the_observatory/fivethirtyeights_disappointing.php).
95 adam.smith 09.04.15 at 6:52 am: very good post by faustusnote @92. Otherwise, just joining the choir in saying that of course every frequentist statistician knows and uses Bayes’s theorem.
If you think that’s not the case, you probably shouldn’t engage in the “frequentist vs. bayesian” debate at all because you’re not even understanding it on a superficial level.
96 John Quiggin 09.04.15 at 9:12 am: @TM Classical probability theory (of which Bayes theorem) is part, is not the same as classical statistical theory (hypothesis testing, confidence intervals, the frequentist definition of probability and so on).
97 faustusnotes 09.04.15 at 9:33 am: John, classical statistical theory is built on classical probability theory and Bayes theorem is taught at the very beginning of it. No two things are the same, but what TM said is absolutely correct. The false positive probability example is taught in every introductory stats course, and all introductory stats courses are classical frequentist.
98 Hidari 09.04.15 at 10:17 am: @96
John
you are of course absolutely correct. However I think the problem is not that faustnotes, TM etc know too little. It’s probably because they know too much. They (presumably) are experts in statistics/probability theory etc. and they assume that everyone else is too.* But the OP was about statistical testing and the replicability problem in psychology . I have a great deal of knowledge of what psychologists are taught in their ‘statistics course’ (at undergraduate level) and I can absolutely assure everyone right now that Bayes theory is not taught at any point (to the best of my knowledge, this is true of essentially every ‘Western’ university, although of course I could be wrong about this).

If anyone cares I have been taught a course in statistical theory at post-graduate level at a Russell Group university and Bayes theory was not mentioned once, nor was it alluded to.

*In other words, when they talk about ‘statistics courses’ they are talking about someone studying applied statistics as a major, at undegraduate level, where what they are saying, I am sure, is true. They don’t know what is taught on a ‘statistics course’ as a module or whatever on a psychology or sociology undergraduate degree.
99 faustusnotes 09.04.15 at 10:33 am: Actually hidari, one of my problems with criticisms of “frequentists” by “Bayesians” is that until recently it was basically impossible to do Bayesian stats in any practical setting due to its analytical difficulty, and its computational impossibility. Part of teh reason that classical theory has the stranglehold it does is its ease of application – it’s practical to do and easy to report to policy makers. Until recently MCMC stuff was really beyond most ordinary computers and it still is in practice beyond most ordinary users of stats. So to read Bayesians criticizing Frequentist stats as “wrong” when in fact they’re hte only stats that have been possible for all but the simplest tasks over the past 100 years is … well, it’s frustrating.

For the record I have a Masters in Stats where I didn’t study Bayesian theory (I am self taught in that aspect), but I actually encountered Bayes’ theorem in an epidemiology course first – it was taught in the intro to stats class. I agree it is not taught in psychology and I have worked with psychologists a lot, and none know it. But I think this is because of the problems of doing Bayesian stats, which historically have forced practical courses to focus on classical stuff and produced a lot of institutional momentum in that area.

In my opinion stats is primarily used as a tool by people with little mathematical rigor, and we need to recognize that, respect it, and teach accordingly. Teaching Bayesian stats in this context is hard, and sneering at (or laughing at, or just generally subtly belittling) the large group of people who are forced to use stats practically is completely unproductive. If we want Bayesian stats to be used to solve some of the replication problems identified in hte OP (I don’t think it will, as I have said, but …) we need to find ways to a) make it accessible to people with limited mathematical background and b) make its results palatable to policy makers and clinicians.
100 mbw 09.04.15 at 1:58 pm: @faustus #92, adam.smith #95. Yes of course every frequentist uses Bayes thm. Instead of rapidly typing “use Bayes” I should have spelled out “use Bayesian priors”. BTW, if you’ll check #82 you’ll see that I’m not exactly a hardcore Bayes-only type.

Now to the specifics re #92:

My point was not that with very large N you can’t exclude false hypotheses, e.g. that aspirin is 100% safe, by any standard statistical technique. It’s that the standard prior (zero effect) is generally known to be false ahead of time. So going through the exercise of rejecting it (with huge N) or failing to reject it (with smaller N) simply doesn’t correspond to the actual problem at hand. This is not exactly a new point, it’s what Deming and many others went on about.

You say test against the 2/3 effective null. Ok, but in all the actual papers the 0% effective null was used. Until there was an extremely clever natural-experiment paper with N of about 3,000,000, nobody was in position to reject any nulls. So you would have gotten the opposite answer to the standard literature because you picked the opposite null.
Look, to pick an old cliche, my best friend is (literally) a frequentist. But when I write of rigid frequentists I mean those who feel compelled to pick null, do HT, give p-values, even when that’s totally inappropriate for the problem at hand.

You want an example of how things would be done differently, not just at the edges. Ok, real world example. Say that a company has a vax for (IIRC) meningitis. In a big phase 3 trial on kids. Working well, as it did in small trials. Do you interrupt the trial early to start giving the vax widely and save dozens of lives. Or do you wait til the trial is over, as (no surprise) a rival company urges the FDA. Your BIL is called in as an expert to help decide. He tells you about subtleties of the statistics of trial interruption. You ask “But what if we took some reasonable prior distribution on the effectiveness, rather than pretending we thought it was probably zero?” He does a double take, says “Oh, you mean does it really work? Yes, we know it works. But the FDA uses p-values.” And no, I’m not a fiction writer.
101 TM 09.04.15 at 1:59 pm: I was of course going to say the same as 97. Recall that I said there was *no contradiction* between Bayes theorem and frequentist statistics. There is no need to choose between those.

I have talked a bit about what I perceive as shortcomings of current statistics education. I don’t see more Bayesianism as the solution. I agree with fn (and perhaps Hidari) that especially non-statisticians have a hard time with Bayesian formalism. Frequentism is more intuitive and easier to understand and that is not least true for conditional probabilities, which I believe are best understood with a Venn diagram. Take the HIV incidence example. You have a population and so and so many members are HIV positive, so and so many are IV heroin users, and so and so many among the heroin users are HIV positive. This typical setting is easiest to analyze in terms of frequencies. HIV incidence in each population is a frequency and conditional probability is just another frequency. I think that students are very unlikely to make any sense of the Bayesian formalism unless they first understand how it works with frequencies (draw a Venn diagram or contingency table, fill in the numbers etc.).

Furthermore, I’m very skeptical that teaching psychology students to interpret probabilities as “subjective degrees of certainty” will make them better scientists. Also, in response to 98, how exactly do you think that the replication problem could have been avoided if psychologists had a better grasp of Bayesianism?
102 TM 09.04.15 at 2:04 pm: 100: “when I write of rigid frequentists I mean those who feel compelled to pick null, do HT, give p-values, even when thatâ€™s totally inappropriate for the problem at hand.”

If your definition of “frequentist” is “somebody who does inappropriate statistics”, I guess you have proven your point. Or seriously, let’s examine the logic here:
– Frequentism is the prevalent statistical paradigm.
– There are many mistakes made in statistical analyses.
– Therefore, frequentism is to blame. QED
103 TM 09.04.15 at 2:10 pm: 100: Isn’t the purpose of a phase 3 trial in part to determine adverse effects? What is your prior about adverse effects?

Also this. FDA procedures are there to ensure that drugs are safe and effective, and also to convince the public that drugs are safe and effective. It seems to me that statistical formalism isn’t all that is at issue in the example.
104 mbw 09.04.15 at 2:43 pm: @TM Good point about my inappropriate use of the phrase “rigid frequentists” to describe those who try to apply an HT algorithm where it doesn’t belong. My point is not that intelligent frequentists would do that, just that current stats education leads many people to do that.
The vax point is that if you’ve got one that worked in some animal models, worked on one age group, and seemed to work (big uncertainty) in small trials on the group in question, does it really make sense to pretend that the point of the big trial (I may not remember names precisely) is to test the hypothesis that it doesn’t work at all? Setting something like a CI employing both the new data and the modestly informative prior data is what you’re trying to do. And remember, behind those dry words is the question of many months of real-world use or non-use, with dozens of deaths and brain damages at stake. I do understand that those may be costs that need to be paid to maintain confidence that only reliable treatments are being used. And of course in many cases reasonable Bayesian priors would lead to requiring stronger rather than weaker new evidence (compared to HT) before accepting a new treatment.

The questions about how to teach and how to convey information to the public etc. are very important ones. I’m not confident about the answers. The current system seems ineffective, with some very reproducible failure patterns. Maybe that’s partly because it’s sort of jerry-rigged together out of tools that were especially useful 100 years ago. Whether a straight Bayesian approach would work, I don’t know, but I believe that Downey at Olin College may be trying something along those lines. It would be interesting to hear how it’s going.
105 mbw 09.04.15 at 2:51 pm: @TM #101 . On teaching screening stats. My wife is probably the best teacher of intro (frequentist) stats in the world. On introducing the HIV screening example, she has a good idea of what works on real students (N>10,000). And you’re absolutely right, it has to be done with numbers first. Then you can try to do the more formal P(A|B) stuff, but almost no one can actually learn it that way first.
But this is an issue of formalism vs. concreteness, not of p-values vs. distributions.
106 faustusnotes 09.04.15 at 3:08 pm: mbw, your examples all seem to boil down to this contrast.

– frequentist would assume a likelihood, calculate a confidence interval, compare to a threshold, and reject/retain
– bayesian would assume a likelihood, assume a prior, calculate a credible interval, compare to a threshold, and reject/retain

Either method can tweak CI through likelihood; either can choose the wrong threshold. But they have fundamentally the same decision making process. I asked you to explain to me how they are different decision making processes, not to quibble about how to calculate the CI.

Your final example seems to indicate disappointment that the FDA doesn’t allow drug companies to choose a prior. Can you see any reasons for why it might be a bad idea to let a drug company, with millions riding on marketing a product that will cost the health service millions, choose a prior that assumes the drug is effective? Is it possible that the FDA prefers p-values because they want to force all drug companies to work from a null position?

Note also that Bayesian stats doesn’t solve the power problem at the heart of John’s post. You can calculate a probability of an event or outcome on the basis of a bayesian CI, but we know that if you increase n then you get a smaller probability of certain events (usually the rejection events; see above). So you haven’t solved the problem of small n leading to failed tests. Sure your prior might change the n threshold, but when you don’t know the truth the n threshold is unknown anyway – how much have you gained?

And none of this gets out of the fact that I have repeatedly pointed out here: replicability and poor results in psychology are much more determined by study design, measurement method, sample selection and experimental process than the final statistical method. Something John didn’t talk about in the OP, and something that destroys your stats no matter what dichotomy you choose to throw at them.
107 TM 09.04.15 at 4:07 pm: This has been one of the most constructive CT threads I can remember. Keep it coming!
108 mbw 09.04.15 at 4:20 pm: @106 No, I don’t want drug companies to pick the priors. Duh. In the example I cited, the two drug companies would have picked wildly different priors. Now that you raise the question, I’d want the FDA panel to pick the prior.
On the flip side of the same question, currently polluters are allowed to pick the prior (zero harm) under the HT system. So if you’ve got another chlorinated hydrocarbon, just like the last strong carcinogen only with a methyl group tacked on somewhere, until somebody has p<0.05 (the sacred number) on your particular compound, then it hasn't been "shown scientifically to be harmful". Etc. Cuts both ways.
109 mbw 09.04.15 at 4:26 pm: more @106

I think the conventional frequentist might
1. pick a cutoff and threshold
2. compute a likelihood
3. compare to cutoff and threshold
4. make yes/no decision.

A conventional Bayesian would
1. pick a prior
2. compute a likelihood
3. compute a posterior distribution
4. compute expected cost/benefit from an integral over the whole distribution
5. make yes/no decision

Sometimes these come out about the same, sometimes not.
110 mbw 09.04.15 at 4:34 pm: @106 On decision making process. For flu vax for elderly, there were two different standard nulls. One was the one used in the literature, the other was assumed obvious in some of the comments above. Frequentists following the procedure you outline would make opposite decisions depending on the choice of null. Bayesians with a smooth prior wouldn’t have to make that arbitrary choice and could have come to the correct decision with less tsuris.
111 mbw 09.04.15 at 4:49 pm: @99 On computational complexity. Sander Greenland argues that one can easily use standard frequentist software packages to do Bayesian calculations.
112 adam.smith 09.04.15 at 4:55 pm: Gelman actually just posted on this, particularly the role of priors: http://andrewgelman.com/2015/09/04/p-values-and-statistical-practice-2/
good stuff. I think his recurring theme that testing against a zero-effect null in the social sciences is misleading and that a Bayesian approach is more likely to produce relevant/interesting results is quite good. It also shows, though, that incompetent/schematic Bayesian approaches would just as easily lead to bad statistics.

Apart from the “bad design” point that faustusnote makes, I’m not convinced that Bayesian approaches are really more robust to application by people who don’t really understand them. It’s just that that’s currently less common, because people with only basic stats training all do frequentist stuff.
113 faustusnotes 09.04.15 at 4:57 pm: mbw, in almost any real life statistical problem, step 4 of your bayesian approach also applies to the frequentist approach. Almost every likelihood maximization problem of note involves “an integral over the whole distribution”. That should be step 3 in your frequentist process. So then we have for the frequentist:

1. pick a cutoff and threshold
2. compute a likelihood
3. compute an integral over the whole distribution [except in trivial cases]
4. compare to cut/off and threshold
5. make yes/no decision

for the bayesian
1. pick a cutoff and threshold
2. pick a prior
3. compute a posterior distribution [this is actually your integral]
4. compare to cut/off and threshold
5. make yes/no decision

You’re consistently avoiding talking explicitly about steps 4 and 5 in the Bayesian framework. Consider: is mortality higher amongst indigenous Australians than non-indigenous Australians? The decision-making process to answer this question is the same in both methods: pick a definition of “higher”, calculate your preferred representative probability distribution, compare to the definition.

What is your definition of “higher”? 2x? 3x? These are just numbers. The key point is that you will calculate a number, a range for that number, and compare. If you don’t do this, then you have to either a) present a radically different way of determining what is “higher” or b) relegate Bayesian stats to a complex kind of descriptive role, which is boring and irrelevant.

Your choice.

Some people think that you can use Bayesian stats to assign probabilities to each possible case (e.g. my posterior distribution tells me that there’s a 60% chance that Indigenous people have higher mortality). Two problems with this: 1) you still have that sneaky word “higher” in tehre, so you are still comparing to a threshold, i.e. essentially no different to frequentism; 2) policy makers don’t give a Bayesian rats arse for probability distributions, they want decisions. Is it higher or not? I gave you 10,000 bucks for a nice computer so you could calculate the best answer, you give me probabilities? What is it Gandalf said? “Don’t go to the Bayesians for advice, as they will give you both yes and no for an answer.”

Which brings me back to my previous point: can Bayesian statisticians actually distinguish their decision making process from frequentist, or are they just proposing a different type of CI? And can they provide a way to make their supposedly “better” methods accessible to the people who decide health, welfare and drug licensing policy? If not, why should people who engage in those policy areas use these methods for anything except basic descriptive statistics?
114 faustusnotes 09.04.15 at 5:01 pm: mbw @ 110: how would frequentists come to a decision? Against what criterion? How is it different. You consistently elide the actual decision making process in Bayesian stats. You suggest they are calculating a different number, fine, but how are they using that different number.

Also, lots of statistical packages either have only recently introduced Bayesian stats (Stata), need languages that are not at all common in the standard practical fields (python in SPSS) or are inaccessible (e.g. JAGS in R) to all but the most advanced users. Computational complexity remains a problem. I’m running problems in BUGS that take weeks (don’t even get me started on BUGS!) Anyone who thinks computational complexity is not a problem is ignoring the basic needs of basic users, and taking an elitist position.
115 Hidari 09.04.15 at 5:23 pm: @99 and @114
Yes you are absolutely correct. Although to be fair, as you note in the second paragraph of comment 114, things are changing now, although far too slowly.

Incidentally in case you think I’m acting all ‘high and mighty’, I’ve published things with p-values myself, albeit written with gritted teeth, just for pragmatic reasons.
116 TM 09.04.15 at 6:02 pm: 110: The point of your flu vax example somewhat eludes me, probably because I am missing context. What decision was actually at stake and how is it that the choice of null would cause opposite decisions? If they tested the zero null, they would conclude that the vaccine is effective in elderly people. If they tested the 66% null, they would conclude that the vaccine is less effective in elderly people than in the general population. Both of these conclusions are correct and there is no contradiction. What am I missing?
117 Hidari 09.04.15 at 6:16 pm: Incidentally, pace what I said in 115, has anyone checked out this?

https://jasp-stats.org

It markets itself as a Bayesian SPSS for psychologists.

http://www.psychologicalscience.org/index.php/publications/observer/2015/march-15/bayes-or-bust-with-new-softwares.html
118 TM 09.04.15 at 6:51 pm: An egregious example of misused hypothesis testing is the media claim “There has been no global warming since 1995” because the warming trend wasn’t statistically significant. The overall trend has been robust and whether or not an arbitrarily selected time period was too short to produce a significant result was irrelevant.

This is a case where the choice of the null hypothesis does indeed matter and has been used to manipulate the media. The abuse would be obvious to anybody who understands basic statistics and is best explained by showing the confidence intervals.

See https://tamino.wordpress.com/2014/12/04/a-pause-or-not-a-pause-that-is-the-question/
119 mbw 09.04.15 at 7:21 pm: @TM 116 Until the N=3,000,000 natural experiment was able to (barely) reject the zero-effect null, there was a prolonged period in which no data would reject either null. What actually happened was:
1. For a long time, the conventional wisdom was that flu vax drastically reduced mortality, and thus good practice was to give vax. That was based purely on correlation, an error that would be made by neither a good frequentist nor a good Bayesian, both of which are scarce.
2. People pointed out that the effect that had been described was insanely too large to be causal, since the alleged number of lives saved exceeded total flu mortality. So they proposed more serious tests of the null, which was taken to be the zero-effect one. Within the large error bars of those tests, the null was not rejected. So these authors advocated stopping the vax, since there were no grounds to believe in efficacy.
3. Finally (?) the huge natural experiment study showed (using p-values) that the zero-effect null could be rejected. The best estimate is that ~50% of flu-caused mortality is prevented. So now best practice is to give vax.

BTW, where did the huge correlations come from? Nobody vaccinates somebody obviously about to die.
120 mbw 09.04.15 at 7:27 pm: @faustus 113. Sorry, I’m being too compressed. The Bayesian method I describe is absolutely not what you took it to be. The integral is of utility (either positive or negative) times probability density. That’s after the step of just multiplying priors by likelihood, not another version of applying a cutoff. The subsequent “cutoff” is then trivial: is net expected utility positive or negative?

So these really are very different approaches, so much so that one of the Bayesian steps is unfamiliar to you.
121 mbw 09.04.15 at 8:20 pm: @faustus #113. Your indigenous mortality question is bewildering to me. There’s no need for any decision because no action is proposed. It’s purely descriptive. At that level, there’s no reason to compress the description to a single bit.
If some “treatment” were proposed, then we’d have to go through the exercise of estimating the causal effect of the treatment. And that raises an issue that’s often more problematic in practice than Bayes vs. freq: disentangling causal effects from confounding, even for huge N.
122 js. 09.05.15 at 12:52 am: This is a great thread. I don’t know enough to make a meaningful contribution, but I know _just about enough_ to have learned a lot. My favorite kind of CT thread, pretty much. Thanks!
123 faustusnotes 09.05.15 at 2:10 am: mbw, utility calculations you describe don’t seem to have anything to do with Bayesian statistics but are just a thing from your field, I think. Certainly in BDA when he does regression Gelman stops at the poserior distribution of the coefficients, he doesn’t integrate over anything; and it isn’t mentioned anywhere in the BUGS book (which is the only text I have to hand in my house on a Saturday morning). I think this is a tool of your field and not to do with Bayesian stats – or are you suggesting you can’t calculate expectations under a Frequentist framework…? In any case, given this (and assuming I understand what you’re talking about and haven’t been led astray by my reading of BDA), what your method boils down to is this:

1. assume a prior and a likelihood
2. calculate a test statistic (through a utility integration on a posterior distribution + utility measure)
3. compare to a threshold

How is this process in any meaningful philosophical way different to the frequentist framework that John is rejecting? In health we don’t use utilities, we just compare directly, but so what? In either case the big problem will remain your choice of threshold. Up above you were complaining about the use of a 2/3 threshold vs. a 0 threshold – you can make the same mistake if you mispose the question, surely?
124 faustusnotes 09.05.15 at 2:14 am: mbw @121: I think we’re talking at cross purposes from different fields. In health, the presentation of a difference in outcomes between two groups, and the conclusion it was not by chance, is the information that the policy-maker is looking for (since we don’t use utility calculations). so the two policy questions here (to flu vax or not flu vax; to increase investment in indigenous health or not) are both being answered in the same way.

in social epidemiology (such as is being deployed in the Indigenous example) the “treatment” is going to be very challenging to evaluate and will largely be a political decision. If we do evaluate it, this job will be undermined by the experimental design and sampling process long before we get to the stats.

I think my example has been undermined by the different policy processes at work in our fields, maybe.
125 faustusnotes 09.05.15 at 2:20 am: Sorry for the multiple comments, I want to respond to three points but they’re too complex to put in one post.

TM@118 gives an example of frequentist stats getting the wrong conclusion (the global warming “pause”) but this is an example of frequentist stats done badly within their own framework, and no statistician has ever taken it seriously (and Tamino has never stepped outside frequentist stats in taking it down). What we’re interested in here is not the terrible consequences of letting a denialist have access to R, but the terrible consequences of having an actual scientist do the stats correctly.

I think a better example from global warming is Nic Lewis’s egregiously wrong estimation of equlibrium climate sensitivity. He used Bayesian methods to get a value of ECS that is way too low and physically impossible given the paleo record, but he used the method correctly and got published (twice) in reputable journals because of this. He just very carefully chose and justified an “objective” prior that just happens to get the result he and his buddies at the GWPF are looking for. I don’t think he could have got the result from a frequentist analysis done correctly within its own terms.

Nic Lewis’s paper is an example of why policy-makers will always be leery of Bayesian techniques. You can imagine the conversation in an episode of Yes, Minister:

“wait, so you got the data, you assumed that the intervention would work, and after you crunched the data on that assumption you showed the intervention worked? I think we’ll be finding a lot more uses for this ‘Bayesian analysis’ of yours, Humphrey…”
126 John Quiggin 09.05.15 at 6:10 am: The implicit question for classical statistical theory is
“Should novel hypothesis X be accepted as true?”
This implies the need for an objective social convention, such as the p>0.05 rule, and a bias in favor of rejection (since most novel hypotheses are false, and since we know that most other aspects of the research procedure will favor acceptance.

The natural question for Bayesians is
“What choice should decisionmaker Y make in this situation”.
This suggests a role for subjective beliefs and preferences. Bayesian decision theory shows how this can be used in an expected utility framework.

Unfortunately, problems of collective decision sit uncomfortably between the two.
127 mbw 09.05.15 at 3:00 pm: @ 123. You had asked about how a Bayesian would make a decision about what to do. No formal branch of math says anything about values, so one always has to put those in somewhere. Frequentists, at least in your description, insert them crudely into some choice of cutoffs, mixed in with the probability calculations. A Bayesian can go straight through to calculate a pdf. Then to make a decision you can take that to calculate expected utility of an action. Again, the assignment of value is always outside the calculation, and keeping the whole pdf as the output of the pure stats allows that separation to be explicit and simple.
128 mbw 09.05.15 at 3:37 pm: @ 124 I’m surprised you say ” In health, the presentation of a difference in outcomes between two groups, and the conclusion it was not by chance, is the information that the policy-maker is looking for (since we donâ€™t use utility calculations)” . Think of the voluminous recent arguments about mammography and PSA screening. It’s partly about expectation values of treatment effects and largely about assigning utility to various outcomes (death, loss of a breast, incontinence,…). This side of the question appears in almost every discussion. I’m very close to people in the medical stats field and they describe utility evaluations (including simple $ cost) as central to many decisions.

BTW, you made some comment on my “field”, but I don’t know what that was presumed to be. I’m a retired physicist, whose only claim to special credibility in this discussion is that I’ve never had a stats course.
129 mbw 09.05.15 at 4:03 pm: couple of technical points:
@adam.smith #112. Unfortunately, in that particular line of argument Gelman has invented something he calls a “p-value” that doesn’t have the basic definitional property of p-values. In the limit of many replications, it does not have a uniform distribution- in many cases not even close. So e.g. a Gelman “p” of 0.01 doesn’t mean what everyone would think it would mean, that deviations from the null that large or larger would be found in 1% of many repeated experiments. In effect, Gelman has changed the definition of p-value to help discourage people from following up on too many shaky leads. Maybe a worthy goal, but the Bayesian Greenland seems throughly fed up with the verbal trick.

@ faustus # 92 “CIs and t-tests are formally equivalent”. No, only in some very special cases where you happen to have a t-distribution. If you know the population is Gaussian and you know sigma, you want a normal z-test. If you know the population is Gaussian but you don’t know sigma, then with a few caveats you can use t. If you don’t know the population is normal, you’ve gotta use some other distribution. If you don’t have a clue about the underlying population distribution, maybe you have to go non-parametric.

Unfortunately, elementary stats courses often teach people to routinely push the t-test button. That barely improves generality over normal z-tests, at the huge cost of taking away the intuition that students can develop about the normal curve, as opposed to the family of t-distributions.
130 adam.smith 09.05.15 at 5:47 pm: @mbw –

@adam.smith #112. Unfortunately, in that particular line of argument Gelman has invented something he calls a â€œp-valueâ€ that doesnâ€™t have the basic definitional property of p-values.

well, that’s the whole point of the article (by, in case that wasn’t clear, one of the world’s leading applied Bayesians): the definitional properties of p values make them useless to how people think. When is (or isn’t) interpreting p values as something different (i.e. one-side p values as approximations of directional posterior probabilities as Greenberg&Poole suggest) useful. But it does also have the implication that running Bayesian analysis mindlessly with non-informative priors

If you donâ€™t know the population is normal, youâ€™ve gotta use some other distribution.

no, that’s too strong. The t-statistic refers to the sampling distribution of the test statistic, not the population distribution, and because of CLT, T-tests are perfectly fine for random samples from any number of distributions.
131 mbw 09.05.15 at 6:28 pm: @130 Yes, I know who Gelman is. The dispute as I see it is over whether the frequentists should be allowed to use a mathematically well-defined term (p-values) even if they often make bad decisions with it. Greenland and others may not be big fans of frequentist approaches but they do want to keep the mathematical language clean, including the term p-value.

As for t-tests, by the time the CLT has kicked in, you have a good estimate of sigma so t doesn’t change things much. Yes, there are cases where it’s better than z, but the emphasis on t in teaching is counterproductive. And unfortunately (as we saw above where a commenter thought it was formally equivalent to general CI’s) it’s often used for cases where the CLT has not come close to kicking in, in which case it gives false confidence.
132 Abbe Faria 09.05.15 at 6:31 pm: “How is this process in any meaningful philosophical way different to the frequentist framework that John is rejecting? In health we donâ€™t use utilities, we just compare directly, but so what?”

Well it’s basically just a massively disingenuous cheat isn’t it? With a Frequentist result you can say â€œmost published research results are wrongâ€œ, *if and only if* you’re a Frequentist, because then you have a concept of Type I and Type II errors and a concept of coverage. And you actually think things can be wrong.

I don’t know how a Bayesian can possibly appropriate that criticism if they believe you only think in terms of degrees of belief. The equivalent honest Bayesian objection is basically just the completely uncontroversial and pointless “different people have different beliefs, but you can’t be wrong, after all it’s all subjective anyway”. Well, yeah, thanks for that. If they can show the CI can’t be generated from some prior they can add, “your beliefs are irrational”, but Frequentists will still have actual coverage.
133 Bruce Wilder 09.05.15 at 7:56 pm: I’ve been following the back-and-forth of the discussion with interest and pleasure, but — and maybe I am especially dense — I am not really seeing exactly how the classical statistical theory v Bayesian approaches apply in context. JQ says the decision problem falls between the two, though it is unclear to me what that means in relation to the research publishing decisions, or those that follow.

The “collective decision” referenced implicitly in the OP is a presumed chain of semi-independent decisions, beginning with the researchers’ decision initiate a study and then to distill from the study a paper to submit for publication and a journal editor’s decision to publish, followed by journalists publicizing the finding and commenting on its meaning, teachers and practitioners adding the tidbit to their worldview and altering course in a variety of consequential and inconsequential decisions. (If we were talking about medicine, prescribing behavior might be affected.) And, finally, the ouroboros of science replicating results takes over, and the results are digested.

The OP seems exactly right to me to identify publication bias as the key, and the effect of publication bias predicated on (misunderstanding?) classical statistical theory (combined with a culture celebratory of novelty and protective of privilege — research must be “original” to be published, and should not deprecate previous work by more senior scholars) is to publicize results that overestimate effect sizes. The bar is set high enough by NHT that most published studies find a real enough effect, but NHT tends to distract attention from effect size, and most studies either fail to highlight the practical importance of the estimated effect size, or publicize an effect size, which is a gross over-estimate: on average, something like twice(?) the best estimate of effect size the study’s data could actually justify.

Arguments for a conventional test based on Bayesian rationales and procedure presumably rest on focusing more on correctly estimating effect size and opening up consideration of the practical implication of that estimated effect size, by providing avenues for quantitative estimation of practical importance.

Whether a Bayesian procedure could help to remedy the other aspect of publication bias: the tendency to exclude “null results” because null results are falsely presumed to have no information of interest — on that I am less clear. The researcher needs to have a way to make a quantitative case that the research design had sufficient power that a “negative” result is meaningful and arguably important (but what’s the argument?).
134 mbw 09.05.15 at 8:33 pm: @BW 133 Agreed.

@AF 132 Say that those psychology papers had published posterior pdf’s rather than p-values. A Bayesian can look at the same collection of pdf’s, transform them to variables in which they’re uniform. Now look at how the replication results fit on that distribution. Whoops, they’re way clustered over toward the left. You’ve shown there’s systematic bias of some sort. Up to a point, that could be mainly publication bias (especially for results just passing the arbitrary significance cutoff). The reported tendency of surprising results to reproduce less well is just what any Bayesian would expect. When the replication result is really way off in the left-hand side, you say in that individual case there was probably a serious goof. Either the original study overestimated the effect or the replication underestimated it to a surprising degree.

The authors of the Science paper were careful not to get bent out of shape by results crossing from p=0.048 to p=0.052, but sloppy type I/type II thinking does lead to just that sort of confusion.
135 adam.smith 09.05.15 at 9:08 pm: The dispute as I see it is over whether the frequentists should be allowed to use a mathematically well-defined term (p-values) even if they often make bad decisions with it.

no. The dispute is this:
“Sander Greenland and Charles Poole accept that P values are here to stay but recognize that some of their most common interpretations have problems. (…) It is important to go beyond criticism and to understand what information is actually contained in a P value. These authors discuss some connections between P values and Bayesian posterior probabilities.
I [Gelman] am not so optimistic about the practical value of these connections.” (from the Epidemiology article linked to).

It’s a dispute about the usefulness of different Bayesian interpretations of P values, initiated by Greenland&Poole’s suggestions.

As for t-tests, by the time the CLT has kicked in, you have a good estimate of sigma so t doesnâ€™t change things much. Yes, there are cases where itâ€™s better than z, but the emphasis on t in teaching is counterproductive.

that’s a fair point. I’m not sure I’d be happy calling a distribution with an estimated sigma “Z”, but I do agree that most students likely have no clue what they’re doing when using the T distribution and that’s a problem. I took issue with that specific statement, not with your complaint about how T-tests are taught.
136 mbw 09.05.15 at 10:57 pm: @135 Yes, that’s what he writes. I hope you’ll forgive for saying it reminds me of:

“I say orgies, not because it’s the common term, because it ain’tâ€”obsequies bein’ the common termâ€”but because orgies is the right term.
Obsequies ain’t used in England no more nowâ€”it’s gone out. We say orgies now in England. Orgies is better, because it means the thing you’re after more exact. It’s a word that’s made up out’n the Greek orgo, outside, open, abroad; and the Hebrew jeesum, to plant, cover up; hence inter. So, you see, funeral orgies is an open er public funeral.”
137 faustusnotes 09.06.15 at 3:26 am: mbw, before you corrected me to say that the utility function is an extra step in Bayesian analysis, now you’re saying that it’s done in frequentism too. So what you’re saying now is that the decisino-making process is exactly the same in both methods, the only difference is how the criterion is calculated? If so, then surely both methods are equally at risk of making the wrong decision? If your argument against frequentist methods boils down to “one should always do the right calculation” then it’s no argument at all (see my point about Nic Lewis’ miscalculations using Bayesian methods).

I don’t think that p-values are different to CIs. p-values are literally simply a transformation of the CI. Instead of comparing the threshold value to the probability of observing it under the given distribution of the outcome value (e.g. sample mean), you transform the test to a standard normal distribution, and calculate the probability under the standard normal distribution, then present it as a p-value. It is literally exactly the same number.

This is literally how every anaylsis in the first half of BDA3 and the BUGS book is done:

1. assume prior and likelihood
2. calculate outcome of interest (e.g. mean difference, regression coefficient) and its posterior distribution
3. use the posterior distribution to make a judgement about whether the outcome of interest is 0 or not (e.g. compare credible intervals for a regression coefficient to 0)

This is absolutely 100% the same as a Frequentist decision making process. You even see the same stats in the outcome table (regression coefficient and 95% CI). The only difference is that with a frequentist decision-making process you do an additional transformation of those two things to enable an easy representation of your results against an agreed standard.

(I’m not interested in quibbles about t- vs. Z but you can just substitute the necessary words for t in the above statements and get the same meaning).

If step 2. uses a utility function, it doesn’t change anything: e.g. if you want to call “death” a measure of “utility” then whatever, you just use a poisson distribution for your deaths and get a CI that you compare against 0/an accepted standard/an existing treatment. No part of this process changes when you swap the word “credible” for “confidence” in the “CI”. If you integrate over this utility function to get a single value that is above or below zero, all that means is that you’ve made a transformation to express the combination of utility and distribution against an agreed standard.

John is suggesting that if you swap a prior and a likelihood for just a likelihood in step 1, then all the replication problems will go away. This is assigning a degree of magical powers to the prior that it just doesn’t have. You can choose the wrong prior, or a prior (e.g. non-informative – there are lots of examples in the classic texts) that erroneously increases the width of your CIs, or a prior that reflects the wrong knowledge, and you’ll get just as egregious a set of mistakes as if you didn’t use a prior at all. And at stage 3 you’ll still be comparing your outcome (now magically “improved” by your magic prior) against a null hypothesis, you just won’t have put the letter H and the number 0 in front of it, because you’re Bayesian now and you don’t admit that you’re using the same grubby logical process.

No statistical method ever invented, or to be invented, can change the experimental process. All they can do is reduce the sample size required to get the correct answer.
138 faustusnotes 09.06.15 at 3:31 am: Also I think Abbe Faria at 132 makes a good point. Defenders of the value of Bayesian stats to policy-makers and clinicians need to give better explanations of how to justify and defend priors.
139 JimV 09.06.15 at 3:56 am: Apologies if this link has already been mentioned:

http://www.nature.com/news/scientific-method-statistical-errors-1.14700

The chart labeled “Probable Cause” purports to give an example of the extra information a Bayesian analysis of experimental results can give. Yes, it depends on being able to distinguish among cases where an effective result is 19-1 against, even money, or 9-1 in favor, which will be subjective, depending on the amount of prior information.
140 mbw 09.06.15 at 4:33 am: @137 I’m out of words. Maybe best to swipe some from Brecht:

The aim of science is not to open the door to infinite wisdom, but to set a limit to infinite error.
141 Bruce Wilder 09.06.15 at 5:43 am: No one likes to admit that we are all just guessing.
142 TM 09.06.15 at 3:01 pm: John 126: The natural question for Bayesians is â€œWhat choice should decisionmaker Y make in this situationâ€.

Question: how does that apply to your original topic, the psychological experiments? I thought these psychologists were trying to increase our scientific understanding of human behavior, not making or proposing decisions.

Whatever the purpose of statistical analysis, cost benefit analysis and policy decision making are outside of its domain (this also in response to mbw). Decision making is based on many inputs, including but certainly not restricted to scientific data and statistical analysis. I strongly disagree with conflating these domains.

I’m also very uncomfortable with incorporating “subjective beliefs and preferences” into statistical procedures. That is precisely the problem with Jeff Leek in 94 and Nic Lewis in 125: statistical formalism is just camouflage for a priori subjective beliefs.

Also, a quibble but an important one, the outcome of a classical hypothesis test is never never never that “hypothesis X is accepted as true”. It’s always rejection or failure to reject H0. This is an inexcusable (though pervasive) mistake and it really matters.
143 TM 09.06.15 at 3:34 pm: mbw 119: Thanks for the clarification. I think the problem boils down to this. In HT, the standard approach is to minimize type I error and the standard but arbitrary threshold is p=.05. The assumption is that it is more important to avoid (i.e. minimize the likelihood of) a type I error than a type II error. But in the flu vax case, a type I error may have been less harmful than a type II error and so relaxing the standard might have been appropriate. This kind of argument is entirely legitimate within the HT framework and can be expressed naturally in the language of HT (type I and type II errors and so on). Of course there is also a legitimate concern about bending the standards ad hoc but nothing in the theory of HT requires practitioners to adopt rigid standards. If excessive rigidity is the problem with classical hypothesis testing, then it’s not a philosophical one but one of custom and practice.

And yes, to talk about the harm caused by type I and type II errors requires some sort of cost benefit analysis. But I contend that this step must be separate from the statistical analysis.
144 mbw 09.06.15 at 4:45 pm: @TM 143 You’ve put your finger on the key question. The standard procedure puts those value judgements into setting the p-value thresholds, all mixed in with the statistical analysis. My point is that because Bayesian analysis doesn’t insist on compressing the result prematurely but outputs a full pdf, one can keep the values and decision making process entirely out of the statistical argument. Then you can come along afterward with a separate alleged utility function, and argue over that without then having to go back and change cutoffs etc. in the statistical part of the argument.
So there are two separate Bayesian pluses.
1. Priors come in explicitly and in a form more representative of our actual knowledge, rather than in a canned special-null form suited only to a few problems. (And there’s no dopey sacred number custom, although in principle frequentists can’t be blamed for that.)
2. Values come in explicitly and entirely separate from statistical analysis when you use the statistical analysis to make decisions. And you use the whole analysis, including relatively unlikely major events.
Thus the relatively objective likelihood analysis stands cleanly in the middle of the process, uncontaminated by values and by the more arguable aspects of the priors.
145 bob mcmanus 09.06.15 at 5:17 pm: 141: No one likes to admit that we are all just guessing.

No, not all of us, to the same extent.

While the marks have been refining their “systems” for prediction, the reality-based community have always known that the two-headed coin, marked cards, stacked deck and especially owning and running the “house” have always constituted the bestest priors.
146 Bruce Wilder 09.06.15 at 8:17 pm: Some are better guessers, and you begrudge them this gift? Oh, bob . . . why so resentful? ;-)
147 Ram 09.06.15 at 10:26 pm: The problem in your example is that the original study was underpowered to detect the true effect (30% power to detect a difference of 0.1). Provided a study is adequately powered (80% is the convention in medical research), then a p < 0.05 finding will replicate 80% of the time. And in the event there is no effect, we would expect a failure to replicate 95% of the time. Pretty good.

In other words, hypothesis tests that properly control type I error rates and type II error rates should replicate almost always. That standard practice is to misuse such testing in a way that inflates these error rates is no critique of the appropriate procedures.

One way critique the appropriate procedures by saying they are difficult or impossible to employ in practice, but Bayes doesn't offer any alternative. Doing Bayes correctly may be easier in practice, but it's only easier because it doesn't concern itself with these error rates. Try to find a Bayesian procedure that appropriately controls them and you will find you have a satisfactory frequentist procedure as well.
148 phenomenal cat 09.06.15 at 11:06 pm: “While the marks have been refining their â€œsystemsâ€ for prediction, the reality-based community have always known that the two-headed coin, marked cards, stacked deck and especially owning and running the â€œhouseâ€ have always constituted the bestest priors.” mcmanus @145

I don’t know, seems to me “the reality-based community” has always and only been playing the most comprehensive game of make-believe. They just have the best tools (or is it toys?) for the game.
149 Bruce Wilder 09.06.15 at 11:12 pm: Ram @ 147: That standard practice is to misuse such testing in a way that inflates these error rates is no critique of the appropriate procedures.

Standard practice is the procedure in context. To excuse standard practice as “misuse” is obdurate.
150 mbw 09.07.15 at 2:45 am: @Ram #147. First, a minor point: it’s true that if the exact null is true then 95% of the time it will not be rejected in a study with those power conventions. However, the exact null forms a set of measure zero in the parameter space, and in most cases there’s no special reason to think it’s exactly true. So then for effects that are too small to be important you don’t actually know what percent will produce “positives”.

More importantly, what you say about replicating positives is just plain mathematically false. For effect sizes that would give (in a measurement that happens to be right on the button) p just less than 0.05, the probability that a replication will again give a positive result is just barely over 50%. When you say that the study has 80% power to detect a positive effect of some size, that size is picked based on some utility criterion. That means that an effect meeting your criterion must by ~3 sigma away from the null to have 80% chance of coming out positive with those cutoffs. So if you actually see a result at say p=0.04 what’s the probability that a replication will also have p<0.05?
There's no frequentist answer to that. A Bayesian can give an answer, but of dubious reliability since it depends on the assumed prior distribution. Another Bayesian might give a different answer.

In my opinion confusion about the meaning of p-values is the norm, not the exception.
151 John Quiggin 09.07.15 at 4:52 am: TM @141

To take a famous study that failed replication, if you are interested in promoting support for equal marriage, a study claiming that personal testimony from gay canvassers is highly effective will be interesting. If that’s consistent with your prior beliefs, you might switch your budget from media advertising to doorknocking. But if your priors are informed by previous studies showing that personal testimony is usually ineffective, you might choose not to change your strategy.

On the quibble, I disagree. As you say, there’s an important distinction in classical hypothesis testing between failing to reject the null hypothesis and accepting the null hypothesis as true.

But, when you reject H0 in favor of H1 you are accepting H1 as true, just as I said. This can be seen any time you read the one sentence summary of an article in this framework. It invariably states something like “Treatment X causes effect Y, study says”. Not the double negative “Study rejects hypothesis that Treatment X has no effect Y”.
152 faustusnotes 09.07.15 at 5:04 am: That’s an unreasonable quibble, John, since the analysis was conducted fraudulently. An interesting point though, since in order to bend the frequentist method to match their priors and show something that everyone knew was not in the data, they had to commit fraud. Where’s the problem with the frequentist framework there?

Whereas a Bayesian could just prove it was true by selecting a prior.

When you reject H0 in favour of H1 you are not accepting H1 as true.
153 adam.smith 09.07.15 at 5:10 am: If thatâ€™s consistent with your prior beliefs, you might switch your budget from media advertising to doorknocking. But if your priors are informed by previous studies showing that personal testimony is usually ineffective, you might choose not to change your strategy.

I think that’s a very bad example. The LaCour/Green paper included a replication of the original study. It also had very robust effects. I.e. you’d have to have used _very_ strong priors in the opposite direction to not get the same effects with Baysian analysis–priors that a) are not commonly used in Bayesian analysis as practiced and b) were not warranted by the relatively shallow knowledge on canvassing, particularly wrt gay rights, prior to the study.
That’s why almost everyone in polisci familiar with the prior literature (including Bayesian folks like Gelman) found the results credible.

The idea that Bayesian methods can protect you against outright fraud turns them into some kind of magic wand that they’re surely not.

If you want an example where Bayesian priors would likely have helped, the Bem study on ESP is a much better case (although proper frequentist analysis, taking into account multiple comparison issues etc. would also have worked there).
154 Ram 09.07.15 at 2:16 pm: mbw @ 150,

You’ve misinterpreted my comment. Assuming H0, if the original study failed to reject H0, then the study should replicate 95% of the time. Assuming H1, and the effect size the study was powered to detect, if the original study rejected H0, then the study should replicate 80% of the time.

On the other hand, assuming H0, if the study rejected H0, then it will replicate 5% of the time. Assuming H1, and the effect size the study was powered to detect, if the study failed to reject H0, then it will replicate 20% of the time.

What, then, is the overall replication rate? Assuming H0, it is .95 * .95 + .05 * .05= .905, and assuming H1, and the effect size the study was powered to detect, it is .80 * .80 + .20 * .20 = 0.68. If the actual effect size is bigger than what the study was powered to detect, the replication rate is even higher. If it is a lot smaller, we can end up with a low replication rate. Which is just to say that underpowered studies do not replicate particularly well. Which was the point of John’s post. My point is simply that this is why, if you decide to do a hypothesis test, it is important to control both error rates, since these are the principal determinants of replicability. That many people do underpowered studies, or studies with (implicit) multiple comparisons problems, does not invalidate the testing procedures we teach, though it may point to some failures in how we teach it.

I’m an applied statistician by the way, and I use Bayesian methods to solve all sorts of problems. But people tend to conflate classical statistics with frequentist statistics, and ordinary practice with frequentist recommendations. Frequentism doesn’t have any methods, it just provides a way of evaluating the performance of data-dependent procedures under replication. In statistics we basically always want to use methods with good frequentist properties, but those methods can be anything at all. We might arrive at them using a Bayesian approach, a classical approach, some machine learning method, or whatever. But the justification for using THAT procedure is that it, e.g., controls type I/II error, provides adequate coverage, is admissible, minimizes (say) MSE, etc.
155 TM 09.07.15 at 2:39 pm: JQ: “But, when you reject H0 in favor of H1 you are accepting H1 as true, just as I said.”

Some practitioners may do that but it’s not standard and it’s not correct. To briefly explain why: H0 is always a specific hypothesis expressed usually as an equality. H1 is expressed as an inequality and is really a composite of many different hypotheses. It is not in itself a testable hypothesis. Only when made specific does it become a testable hypothesis (i.e. can be used as H0 in a new study).

H0 is supposed to express current knowledge. Rejection of H0 indicates that current knowledge needs to be refined or revised but it doesn’t in itself say what to replace it with. The scientific process doesn’t stop when some H0 has been rejected, it only then starts getting interesting. This is why I said way above that the lesson from the psychology crisis is that real science requires many replications, not just one (let alone zero). Most of the criticism of HT expressed above is made as if science consisted of single isolated studies. But no scientific question has ever been settled with a single experiment. We fundamentally know that no matter what methodology we use, there will always be errors. There is never certainty. There will always be false positives and false negatives, however we call them. To me, this is the single most important fact that students of statistics need to learn before they can be let loose on the world, armed with sophisticated tools that most of them will never truly understand.
156 TM 09.07.15 at 2:57 pm: mbw 144: I think many of us are skeptical of allowing subjective priors. You say that priors are not related to value judgments, I’m not so sure. How can you not be tempted to choose a prior that will give you the preferred result?

“My point is that because Bayesian analysis doesnâ€™t insist on compressing the result prematurely” – we have already agreed that compressing a study to a one-bit HT is poor practice. Nothing prevents us from handing the decision-maker a CI. They can then apply their cost-benefit parameters to the CI and use that to make the decision. We can even calculate several CIs – 90%, 95%, 99%. The decision-maker must understand statistics well enough to be able to interpret these results.

One concern that I would have about the Bayesian alternative: how hard is it to explain your results to a decision-maker?
157 Dogen 09.07.15 at 4:17 pm: Here is a very recent Andrew Gelman article on the article and issues being discussed on this thread.

The quote that caught my eye in regard to the interesting semi-controversy in this thread:

“A close reading of Barrettâ€™s article reveals the centrality of the condition that studies be â€œwell designed and executed,â€ and lots of work by statisticians and psychology researchers in recent years (Simonsohn, Button, Nosek, Wagenmakers, etc etc) has made it clear that current practice, centered on publication thresholds (whether it be p-value or Bayes factor or whatever), wonâ€™t do so well at filtering out the poorly designed and executed studies.”

And here is the whole post, which is quite interesting in its own right and well worth reading, I think:

http://andrewgelman.com/2015/09/02/to-understand-the-replication-crisis-imagine-a-world-in-which-everything-was-published/
158 mbw 09.07.15 at 7:03 pm: @Ram #154. The first part of this sounds correct. But it’s very different from what you wrote the previous time. Yes, if the specific H1 is correct then with these cutoffs it will give “positive” results 80% of the time. But before you wrote that p<0.05 results will replicate a positives 80% of the time, which isn't even close to being the same thing.
BTW, how likely is H1 to be right? Generally, as a set of measure zero picked not by any prior belief about probability but by some utility criterion of importance, the probability that H1 is true is zero. This detracts somewhat from the importance of calculating the rate of nominal positives if it were true.
Then to calculate the total replication rate on the assumption that either H0 or H1 must be true (an event of probability zero) is odd.
159 mbw 09.07.15 at 7:08 pm: @TM #156 We’re pretty close to agreement. I think many decision makers can handle a set of probabilities for different outcomes, a little more transparent form than a collection of CI’s. Of course you can concert cdf’s to pdf’s, but the pdf form seems more intuitive for most purposes.
160 Ram 09.07.15 at 8:41 pm: mbw @ 158,

I apologize if my initial comment was imprecise. This is probably not the ideal forum to reevaluate the foundations of hypothesis testing theory. Suffice it to say, the frequentist interpretation of probability, frequentist decision theory, and its applications to point estimation, interval estimation, hypothesis testing, and numerous other inference problems, are philosophically coherent and mathematically exact. Online criticism of p-values and the like usually exhibit no awareness of this, suggesting that the entire framework is fundamentally flawed. It is not. Any data-dependent procedure that has good frequentist properties, which are well-defined properties that can (in principle) be empirically demonstrated through repeated experiment or simulation, is a frequentist procedure. Which frequentist properties we want a procedure to have depends on the problem we are trying to solve. Translating every study into a hypothesis test in order to generate a p-value is a problem with current publication practices, but it is not a problem with hypothesis testing. Hypothesis testing. Hypothesis testing, when performed correctly in appropriate applications, delivers exactly what is promised: a conclusion which will be correct the vast majority of the time. As do other procedures.

Any method can be used to generate procedures with good frequentist properties. It turns out that Bayesian methods, suitably applied, do this particularly often, which makes them useful for solving many different kind of problems. Still, if one produces a 95% posterior interval that has 2% coverage, I think one has produced something that is pretty useless. It is not hard to generate examples like this. Good statistical procedures have good statistical properties, and good statistical properties are good frequentist properties. Good Bayesian properties often but not always generate good frequentist properties, which makes them useful, but good Bayesian properties alone are not enough.
161 mbw 09.07.15 at 9:20 pm: @Ram #160. I think I agree with all that.
I’m impressed, however, with how consistently the standard presentation of stats leads to deep confusion on the part of most ordinary practitioners. I can see in my wife’s teaching that the outcome can be dramatically better if frequentist stats are taught in a very different way. (The course is descended from Freedman, but with some of his neurotic kinks removed.) I don’t know whether the outcome would be better if the approach were Bayesian from the start. The main reason to suspect that it might be is that since frequentist stats answers a very different, but similar sounding, question to the one of interest, the tendency toward confusion seems built in.
162 Ram 09.07.15 at 10:59 pm: mbw @ 161

I took AP statistics in high school, which was taught from a classical point of view. It made no sense, appearing to be a hodge podge of unrelated procedures for tackling unrelated problems, leading me to pursue other subjects. It was only when I learned of the Bayesian perspective (in a philosophy of science class!) that I became interested in statistics once more. The logic of Bayes is remarkably elegant: give me a probability model, and a loss function, and I will give you a decision optimally informed by the data. The problem is that Bayes itself gives no guidance about where the probability model should come from. This is often stated in terms of a problem with the prior, but the likelihood has the same problem–a Bayesian likelihood does not mean the same thing as a frequentist likelihood, even if they look the same and follow the same mathematical rules. P(y | theta) cannot be checked against the data, but P(y; theta) can be.

Frequentism’s vice is that, while it gives us a conceptual framework for evaluating data-dependent procedures, it does not tell us how to construct good procedures. This is why classical statistics seems like a hodge podge. In my view, frequentism and Bayesianism are two great tastes that taste great together. Give me a problem, and I can formulate a reasonable probability model capturing it. Using Bayes, I can update it with the data and arrive at a decision. And using frequentism, I can run a simulation to verify that my decision has the right sorts of properties. Bayesian means, frequentist ends.
163 mbw 09.08.15 at 12:58 am: @162 Makes sense. I’m largely thinking about the teaching, as described in your second sentence.
164 TM 09.08.15 at 3:45 am: 161: “Iâ€™m impressed, however, with how consistently the standard presentation of stats leads to deep confusion on the part of most ordinary practitioners.”

Let me offer a different perspective. The biggest problem for students of statistics is that most yearn for certainty and clear and easy answers. They just want cookbook recipes that will give them some answer. Maybe they have grown up with a kind of techno-optimism that doesn’t leave much space for ambivalence and uncertainty. They haven’t learned to appreciate science as an error-prone human endeavor. Maybe they are confused because reality is confusing.

But maybe Bayesians are less confused than frequentists. Somebody should conduct a hypothesis test ;-)
165 TM 09.08.15 at 4:01 am: 162: “I took AP statistics in high school, which was taught from a classical point of view. It made no sense, appearing to be a hodge podge of unrelated procedures for tackling unrelated problems”

Honestly, how likely is it that the teacher was even qualified to teach more than a basic intro? And anyway there’s no reason why high school statistics should be anything more than a very basic intro, and there’s no reason why such an intro couldn’t be well done (by which I mean that the focus is not on procedures and recipes but on the basic ideas and their interconnections). There’s no reason to force “a hodge podge of unrelated procedures for tackling unrelated problems” on high school students (or really on anybody) and I don’t know why that should be blamed on a “classical point of view”.
166 Matt 09.08.15 at 6:16 am: I took AP statistics in high school, which was taught from a classical point of view. It made no sense, appearing to be a hodge podge of unrelated procedures for tackling unrelated problems.

Barring one exceptional class with a great teacher that I had in the sixth grade, this description eerily resembles all the mathematics instruction I had from first to twelfth grades, including a couple of AP classes.

I was expected to learn procedures for problems, apply those procedures and only those procedures to arrive at an answer, and show my work to prove that I was arriving at the right answer the right way. As opposed to getting the wrong answer, or the right answer the wrong way. By high school I had finally learned not to arrive at correct answers my own way, no matter how much easier it was to remember, because I would always score lower. They had taught me mathematical pattern-matching instead of thinking.

This was in the USA. Is it better in other countries? Was this an unusually bad series of school experiences for the USA?
167 Ram 09.08.15 at 1:20 pm: TM @ 165,

It has a lot to do with classical statistics, since it really is a hodgepodge of unrelated procedures for tackling a hodgepodge of unrelated problems. The unifying feature is that these procedures have (classically) good frequentist properties–e.g., unbiasedness, efficiency, coverage, type I/II error rate control, etc. Statistics, when taught from a Bayesian point of view, seems much more continuous. Any problem can be formulated in terms of learning from data in the context of a probability model, based upon which the optimal decision is made. The problem is that different probability models do not in general lead to the same decision, even if we fix the data and the loss function, and Bayes gives no guidance there. This is why classical statisticians evaded Bayes altogether, for fear of subjectivity accusations. The contemporary synthesis, in which Bayes is recognized as a powerful device for developing procedures with good frequentist properties for arbitrary problems, offers the best of both worlds: a common means to solve diverse problems, and common ends in terms of which to evaluate them. Teaching this rather than classical statistics would make statistics appear more integrated and compelling, in my view. One would also like instructors to cover contemporary non-Bayesian approaches, such as those popular in machine learning. I don’t know what the best way is to introduce these topics in ways students can understand, I just worry that the current approach makes statistics seem much less compelling than it in fact is.
168 TM 09.08.15 at 2:30 pm: 166: That is a depressing picture. My Math experience (outside the US) is totally different. Math was the subject where you never had to do any rote learning because everything could be deduced by reason (really now!).

167, I remain unconvinced. Really, hodgepodge of unrelated procedures? A typical intro statistics class might cover the following hypothesis tests:
– Small sample HT for a population mean
– Large sample HT for population mean
– HT for population proportion
– HT for sample mean for paired observations
– Comparing two population means
– Comparing two population proportions

OMG, already a hodge-podge of six “unrelated procedures”! But of course they aren’t unrelated at all, they are all variations on a single idea. Successful statistics instruction would stress the remarkably simple, coherent theory behind these procedures. Also, as mentioned somewhere above, CI and HT should be taught as fundamentally related concepts rather than in isolation. Students who don’t understand this will probably never understand and never be able to correctly interpret hypothesis testing.

Unfortunately, many students will never understand the unifying principles behind all these procedures, whether due to poor statistics instruction (rushing students to practical applications before they have understood the theory), or because students have been spoiled by years of failed math instructions as described by Matt, or frankly because students couldn’t care less. To me that is a tragedy. But I find it hard to believe that teaching from a Bayesian perspective will “make statistics appear more integrated and compelling”. One reason for doubting this is the experience (already mentioned a few times) that students find Bayes (and even simple conditional probability) damn hard. I can see how the best, most motivated students may benefit from more immersion in Bayesianism but I don’t see it for the mass of students who can’t or don’t want to go beyond the cookbook recipe approach to statistics.
169 mbw 09.08.15 at 3:03 pm: @ all the recent comments: (this is a preliminary response while a fuller one with links is hung up in moderation)

OK, hereâ€™s an ad. Frequentist statistics can be well-taught to rather ordinary college students. My wife took a Freedman-based course and built it up to the point where itâ€™s chosen by over 3000 students per year. They mostly get over the addiction to formulas and get hooked on reason. Itâ€™s now taught very successfully by a new teacher, so itâ€™s not just a personal anomaly.

People can look over some materials by googling stat 100 uiuc.
It doesnâ€™t look fancy. Baby stuff, right? Students coming out of this course typically can spot the big confounder in an observational study claiming that school drug testing has no effect on drug use, etc. Compare with typical formula-driven students.

Theft of the materials is encouraged.
170 mbw 09.08.15 at 3:06 pm: @ all the recent comments: (this is a preliminary response while a fuller one with links is hung up in moderation, as is another one that mentioned an apparently dangerous word))

Frequentist statistics can be well-taught to rather ordinary college students. My wife took a Freedman-based course and built it up to the point where itâ€™s chosen by over 3000 students per year. They mostly get over the addiction to formulas and get hooked on reason. Itâ€™s now taught very successfully by a new teacher, so itâ€™s not just a personal anomaly.

People can look over some materials with a bit of googling.
It doesnâ€™t look fancy. Baby stuff, right? Students coming out of this course typically can spot the big confounder in an observational study claiming that school drug testing has no effect on drug use, etc. Compare with typical formula-driven students.

Theft of the materials is encouraged.
171 mbw 09.08.15 at 3:55 pm: I will weigh in on this teaching issue, based on large-N empirical evidence, once the gatekeepers ok it.
172 mbw 09.08.15 at 4:09 pm: Ok, that little note got through. Here’s some tips from what I’ve seen work.
Use box models.
Discuss confounders and casuation.
Don’t allow formula sheets.
Use lots of data from anonymous surveys of the class.
Make a lot of the survey questions about sex.
Don’t dwell on picky points in the first semester.
Get great undergrads to do videos on the hard points.
Make randomized automated homework so students can’t pass answers.
Most importantly, fill the exams with questions that can’t be answered without getting the real concepts.
Institute intense anti-cheating techniques on exams.
etc.
173 mbw 09.08.15 at 4:10 pm: whoops: “causation”
174 mbw 09.08.15 at 4:23 pm: But that last note is not the one that has useful url’s in it, because anything like that seems to be hung up.

Comments on this entry are closed.

The great replication crisis

Recent Comments

Search

Archives

Pages

Book Events

Contributors

Fine Print

Lumber Room

Old Wood

Meta

Recent Posts

Tags