There’s been a lot of commentary on a recent study by the Replication Project that attempted to replicate 100 published studies in psychology, all of which found statistically significant effects of some kind. The results were pretty dismal. Only about one-third of the replications observed a statistically significant effect, and the average effect size was about half that originally reported.
Unfortunately, most of the discussion of this study I’ve seen, notably in the New York Times, has missed the key point, namely the problem of publication bias. The big problem is that, under standard 20th century procedures, research reports will only be published if the effect observed is “statistically significant”, which, broadly speaking means that the average value of the observed effect is more than twice as large as the estimated standard error. According to the standard classical hypothesis testing theory, the probability that such an effect will be observed by chance, when in reality there is no effect, is less than 5 per cent.
There are two problems here, traditionally called Type I and Type II error. The classical hypothesis testing focuses on reducing Type I error, the possibility of finding an effect when none exists in reality, to 5 per cent. Unfortunately, when you do lots of tests, you get 5 per cent of a large number. If all the original studies were Type I errors, we’d expect only 5 per cent to survive replication.
In fact, the outcome observed in the Replication Study is entirely consistent with the possibility that all the failed replications are subject to Type II error, that is, failure to demonstrate an effect that is there in reality
I’m going to illustrate this with a numerical example[^1].
Suppose each of the 100 studies was looking at a treatment (any kind of intervention of change) of some kind, which results in shifting some variable of interest by 0.1 standard deviations (in the context of IQ test scores, for example, this would be a shift of 1.5 IQ points). Suppose the population parameters in the absence of treatment are known, and we have a sample of 225 treatments. We’d expect the sample mean value obtained in this way to be, on average 0.1 standard deviations higher than the value for the population at large. But the sample mean itself is a random variable, with a standard deviation equal to the population standard deviation divided by sqrt(225) = 15. That is, if we normalize the population distribution to have mean zero and standard deviation 1, the sample mean will have mean 0.1 and standard deviation 0.066. That in turn means that about 30 per cent of the observed samples will have a value greater than twice the sample standard deviation, which is roughtly the level required to find statistical significance.
Under best practice 20th procedure, the experimenters would report the effect if it passes the standard test for statistical significance, and dump the experiment otherwise[^2]. The resulting population of reported results will have an average effect size of around 0.2 population standard deviations [^3].
Now think about what happens when a study like this is replicated. There’s only a 30 per cent chance that the original finding of statistical significance will be repeated. Moreover, the average effect size will be close to the true effect size, which is half the reported effect size.
I don’t think that the results of the replications can be explained this way. At a rough guess, half of the observed failures were probably Type I errors in the original study, and half were Type II errors in the replication.
The broader problem is that the classical approach to hypothesis testing doesn’t have any real theoretical foundations: that is, there is no question to which the proposal “accept H1 if it would be true by chance only 5 per cent of the time, retain H0 otherwise” represents a generally sensible answer. But, we are stuck with it as a social convention, and we need to make it work better.
Replication is one way to improve things. Another, designed to prevent the kind of tweaking pejoratively referred to as ‘data mining’ or ‘data dredging’ is to require researchers to register the statistical model they plan to use before collecting the data. Finally, and what has been the dominant response in practice is to disregard the “95 per cent” number associated with classical hypothesis testing theory and to treat research findings as a kind of Bayesian update on our beliefs about the issue in question. If we have no prior beliefs one way or the other, a rough estimate is that a finding reported with “95 per cent” confidence is about 50 per cent likely to be right. Turning this around, and adding a little more scepticism, we get the provocative presentation of Ioannides “most published research results are wrong“
[^1]: Which will probably include an error, since I’m prone to them, but a fixable one, since the underlying argument is valid.
[^2]: In reality, a more common response, especially with nearly-significant results, is to tweak the test until it is passed.
[^3]:I eyeballed this because I was too lazy to look up or calculation the truncated mean for the normal, so I’d appreciate it if a commenter would do my work for me