“Evidence based policy” and the replicability crisis

by Daniel on March 23, 2016

I have a new piece up on The Long And Short, suggesting that the “Evidence Based Policy Making” movement ought to be really very worried about the reproducibility crisis in the psychological and social sciences. In summary, the issue is that most of the problems that the sciences are dealing with are highly likely to be there in policy areas too, meaning that the evidence base for education reform, development economics, welfare and many other policy areas is equally likely to be packed with fragile and non-replicable results. I do suggest a solution for this problem (or rather, I endorse Andrew Gelman’s solution), but point out that it is likely to be expensive and time-consuming and to mean that evidence-based approaches are going to be a lot slower and deliver a lot less in the way of whizzy new policy ideas than people might have hoped.

I got quite a bit of pushback. Responding here to a few points made:

“It’s surely better than nothing”. The idea here is that the fragile and non-reproducible evidence base we are likely to have at the moment in a number of policy areas is still good enough to make decisions with. I don’t see how anyone can say this with any degree of confidence at all. The point about non-reproducible evidence is that it doesn’t constitute a valid test of the underlying true model.

“Should we go back to anecdote and political prejudice then?”. I think this objection is also ill-formed, something which is easiest to see if you think about it in Bayesian terms. The weight that you can put on fragile evidence is low, because you know that its statistical significance[1] has been overstated by an unknown amount. In which case, you would only regard the evidence as shifting your view if your original prior was very weak. So perhaps in entirely new policy areas it might make sense to create policy on the basis of a compromised evidence base. When you have a status quo which seems to be broadly working (rather than in crisis, a case which I’ll come to below[2]), then it seems to me unlikely that non-reproducible evidence ought to convince you that a big improvement is possible. Note also that weak evidence might not even be directionally correct; it’s certainly possible that there is material evidence in the literature in favour of policies which might make things worse.

“We’ve got to do something“. Well, do we? And equally importantly, do we have to do something right now, rather than waiting quite a long time to get some reproducible evidence? I’ve written at length, several times, in the past, about the regrettable tendency of policymakers and their advisors to underestimate a number of costs; the physical deadweight cost of reorganisation, the stress placed on any organisation by radical change, and the option value of waiting. A lot of my scepticism about evidence-based policymaking is driven by a strong belief in evidence-based not-policymaking.

Finally, I’d note that the use of a published social sciences evidence base is not at all necessarily inconsistent with making policy based on political prejudice and anecdote. In my experience, a lot of the Leading British Evidence-Based Community seem to combine a deep sense of paranoia about policymakers wanting to ignore evidence and go back to prejudice and anecdote, with an equally deep naivete about the same policymakers cherry-picking from the evidence offered, a problem which would still exist even if the evidence itself was robust. Since they are in large part advising people like Michael Gove, I think this is a bit of a blind spot.

In summary, my view then is that what we need is genuine, robust-evidence-based policy making, and (therefore) a lot less of it. What we’re likely to get is policy making based on a biased selection from an already weak evidence base, combined with a structural attempt to delegitimise any protest or critique of that policymaking as Luddite and anti-scientific. People need to be worrying about this.

[1] There’s a temptation to read about all the problems with p-values and presume that this is all a problem of frequentist statistics and that we wouldn’t have a reproducibility problem if everyone converted to Bayesianism. Unlikely, IMO. The underlying problem here is methodological, not mathematical. There are some interesting issues which are created by the arbitrary choice of 5% as a significance level, but the general issue of institutional incentives to overstate statistical significance would be there whatever framework you use.

[2] Historical note: I didn’t. I think I had a point here and meant to fill it in, but went to sleep instead and forgot it. Presumably it was to do with the fact that the perception of a policy area as being “in crisis” in the first place, and therefore requiring immediate action, is itself a political decision and subject to huge amounts of cherry-picking. The decision of what you need to gather evidence on is one which itself ought to be evidence-based; we can note that we have huge amounts of evidence-based suggestions for education policy but only John Quiggin appears to be regularly making an attempt at evidence-based intelligence policy.

{ 70 comments }

1 Ebenezer Scrooge 03.23.16 at 10:24 am: What Burke said.
2 RNB 03.23.16 at 10:32 am: Haven’t read this yet. I hear it has been the subject of class discussions.
http://www.vox.com/2016/3/14/11219446/psychology-replication-crisis
3 Ecrasez l'Infame 03.23.16 at 11:39 am: The replication crisis in psychology doesn’t have much to do with evidence based policy making, or really statistics for that matter.

The problem with the replication crisis is much psychological research is fundamentally fucking ludicrous. Take “People make less severe moral judgements when they’ve just washed their hands”. They’re not aiming to estimate the practical effect of a credible intervention. They’re trying to support a wild psychological theory by demonstrating it has an outlandish minor implication via a weak measurement construct in an artificial environment. Of course replication is difficult, but failure in that case isn’t a problem for policy researchers who aren’t attempting anything quite so stupid.

Most evidence based policy making is fundamentally practical. Does this increase literacy? Or increase vaccine uptake? Or aid people finding employment? So you have a serious measurable outcome and a plausible intervention and you’re studying the effect. People successfully did research like this before the invention of t-tests. We don’t have worry about its methodology just because psychologists managed to construct a bogus “willpower depletion” theory of the brain based on how long it takes people to eat marshmallows. That’s a problem peculiar to psychology.
4 Daniel 03.23.16 at 11:50 am: #3: Substantially none of your comment is correct. There is a statistical problem there, the replicability issue in psychology is not confined to results you believe to be “fucking ludicrous” and it is necessary to worry about methodology.
5 BenK 03.23.16 at 12:18 pm: There are some serious problems that lead to issues with reproducibility, replication, verification, and generalization. These problems all feed into the difficulties interpreting the scientific literature. Reproducibility issues mean that even if you only try to repeat what you did before, it might not work because your statistics were questionable. Replication means that if you read the paper very carefully and try to do what the other person did before, it doesn’t work. This is the big problem that big pharma has with academia. The methods can be tricky, specific, and so on. A bit voodoo. It’s an exercise to the reader, so to speak, to find out whether that changes how you interpret the discussion section of the paper. Sometimes it does, sometimes it doesn’t. Verification is about the computations that follow the lab work.

The elephant in the room for policy and evidence-based medicine is ‘generalizability.’
If the experiments are not robust, then … the conclusions only hold under certain conditions (perhaps unnoticed) regardless of the effect size. This is the essential difficulty with ‘weird’ [western educated etc etc] and all those studies on college students. It is the essential problem with drug testing on specific inbred mice. In fact, it is an essential problem. The better controlled the experiment, the less it generalizes. Either you need to do a less controlled experiment, or you need to do more of them under diverse conditions. A ‘less controlled’ experiment risks incorporating bias unnoticed; a controlled experiment imports it purposefully.

Theoretically, controlled experiments are great – but then the conclusions are usually ‘over interpreted.’ The word ‘suggests’ is habitually inserted and subsequently ignored.

Without ‘evidence based’ what do we do? It’s not clear. But certainly, misinterpreted evidence used as a bludgeon to advance particular causes and positions is no better than no evidence at all.
6 Glen Tomkins 03.23.16 at 1:15 pm: If it’s any consolation, and I’m sure it isn’t, the self-styled “evidence-based” approach doesn’t work in medicine either.
7 Slackboy2007 03.23.16 at 1:16 pm: The problem seems to be that psychologists are assuming that their published research does not have an effect on the future behaviour of the people who read them or who have learned something about their results. They are assuming that the behaviour of a person is as constant as the behaviour of an elementary particle, when really if a psychology paper is influential and well reported it will then have the effect of changing people’s behaviours as they will be more savvy to the potential problem that the paper has pointed out and so will change their actions accordingly.

The more reported a paper is in the general media the more influence it should have on the society that it is trying to measure, so the question then becomes: how do you measure the amount of influence that a given paper has had on the collective behaviours of society? I don’t think that there is a simple answer to that question, nor do I think that it is a question that even needs to be answered.

This is why I favour the outlook of (Lacanian) psychoanalysis over general psychology: the latter assumes that people’s behaviours are largely constant and that society is unable to learn from scientific research that is done on its own behaviours; the former on the other hand is fully aware and honest about its interventionist nature, and in fact strives to promote it. To paraphrase Marx: the point of psychoanalysis is not to measure people’s behaviours; the point of psychoanalysis is to change them.
8 otpup 03.23.16 at 2:05 pm: This is actually a larger problem than just the social sciences (a fact that was both surprising and deeply unsettling to discover). Biomedical sciences (especially public health and nutrition) get by on unexpectedly little hard evidence (i.e., experimental data) because of the expense and/or the ethical concerns. As a result, received wisdom plays a bigger role than one might think (or wish). Examples: saturated fat dietary hypothesis, sodium and hypertension, the caloric excess theory of obesity…
9 Mike Furlan 03.23.16 at 2:29 pm: 1. In an environment of declining funding for science, it isn’t surprising that grant writers will be under great pressure to tell the best story to get what money is available.

http://issues.org/30-1/the-new-normal-in-funding-university-science/

2. If your job is to sell charter schools, or a “breakthrough” drug, you will do what you need to do, or will be replaced by someone who will.
10 Patrick 03.23.16 at 2:34 pm: This would seem to be a powerful argument for not claiming that social systems are in crisis when they’re just imperfect. Which in turn means tolerating a degree of racism, sexism, colonialism, classism, etcetera, in our schools and workplaces and government.

I agree, but good luck selling that.
11 TM 03.23.16 at 3:13 pm: I haven’t read your piece but do you address the question of effect size? In many cases, when in social science and policy effects have been “confirmed” statistically, the actual effect sizes were too small to be of much relevance, yet have been used as “evidence” in favor of certain policy prescriptions (worst example is Chetty et al.’s VAM study, which is flawed in many ways but even when taken at face value, the effect size is tiny). In cases when large data sets have been available to be mined (which isn’t usually the case in psychology experiments), very small effects can be statistically significant. Whether they are reproducible is still a different question but often the immediate question should be, is the effect large enough to support specific policies?
12 James Wimberley 03.23.16 at 3:27 pm: Other fields should create their version of Cochrane: really tough meta-analyses of medical treatments. Very often the conclusion is “we don’t know”.

There are also ethical limits on the pursuit of scientific rigour. You have a treatment that probably saves lives. Increasing the certainty of this conclusion involves more double-blind trials – and deaths among the control group. This anxiety probably limits the scope of evidential security in social policy too.
13 Salem 03.23.16 at 3:59 pm: Misinterpreted evidence used as a bludgeon to advance particular causes and positions is no better than no evidence at all.

Really? Why?

The whole point about evidence-based policy (evidence based anything, really!) is that the conclusion follows from the evidence. And this means follows in a causative sense – if the evidence had been otherwise, the conclusion would have been different. If and only if the evidence suggests that policy A is more effective than policy B do we conclude that policy A is superior. That way there is a chain of causation from how effective policy A is, in the world, to whether we implement policy A. A good thing!

If, on the other hand, we just decide to implement policy A, and then go looking for whatever evidence supports it, there is no causative link between the effectiveness of the policy and whether it gets implemented. You may as well just implement policy A and not waste your time on “evidence” that is irrelevant to the decision.
14 scritic 03.23.16 at 4:33 pm: I do wonder about your preference for “evidence-based not-policymaking.” How, for instance, would that translate to the 2008 Obama fiscal stimulus? Conservative commentators like Jim Manzi of National Review certainly argued for something like that back then (see here, here and here) — that there was no experimental evidence that stimulus worked (contrary to say what more Keynesian-oriented economists like Paul Krugman and Brad Delong were saying). Many people noted then that rooting for the status quo was an essentially conservative thing to do. And research results, however careful, will always be open to disputes.
15 tdm 03.23.16 at 4:51 pm: Failure to publish null results is part of replicability crisis. Equivalent in blog comments is failure to publish agreement with author. Therefore, my contribution is to say that I agree with everything Davies says here and in his article.
16 Daniel 03.23.16 at 5:56 pm: In an environment of declining funding for science, it isnâ€™t surprising that grant writers will be under great pressure to tell the best story to get what money is available.

It’s that irregular verb I noticed at the time of the LIBOR scandal:

I do my best with the perverse incentives provided
You game the system
He is a crook.

I havenâ€™t read your piece but do you address the question of effect size?

Not directly except in as much as it contributes the the problem of “intrinsic fragility” which I don’t think can be solved at all, other than by doing less policy altogether.

Other fields should create their version of Cochrane: really tough meta-analyses of medical treatments.

Unfortunately, the best guesses from the psychology replication project are that in a lot of cases, the literature is so contaminated that meta-analysis isn’t going to help either. Even in medical science, people are beginning to worry about Cochrane (the very logo of the Cochrane Collaboration is a forest plot for a metastudy on prenatal steroids which has been found to have been flawed with partial reporting!)

The whole point about evidence-based policy (evidence based anything, really!) is that the conclusion follows from the evidence. And this means follows in a causative sense â€“ if the evidence had been otherwise, the conclusion would have been different

This might very well be the whole point – it seems right to me – but it just underlines the severe problems with evidence-based everything as it actually exists in the world (and if there is one field of policy-making which can’t help itself to theoretical justifications as a defence against analysis of real-world performance, then this is surely it!).

Conservative commentators like Jim Manzi of National Review certainly argued for something like that back then (see here, here and here) â€” that there was no experimental evidence that stimulus worked

I would take this as (another) case in which it was clear that the “evidence-based” – in so far as that is taken to refer to experimental evidence or treatment/response analogies – was not the right way to do policy. I can see why other people would want to say that sound macroeconomic theory and structural modelling ought to count as “evidence based”, but I’d rather just say that there can be other good reasons to do things.
17 b9n10nt 03.23.16 at 6:00 pm: Patrick @ 10:

Or skepticim toward social empiricism could be a tool for legitimizing politics and deligitimizing technocracy.
18 Salem 03.23.16 at 6:30 pm: This might very well be the whole point â€“ it seems right to me â€“ but it just underlines the severe problems with evidence-based everything as it actually exists in the world (and if there is one field of policy-making which canâ€™t help itself to theoretical justifications as a defence against analysis of real-world performance, then this is surely it!).

Agreed. I am certainly not suggesting that Evidence Based Policy (or Medicine) lives up to its billing.

I think these problems are particularly severe where people do not have skin in the game, or when, perhaps, they have skin in another kind of game. Real empiricism requires iteration and feedback, but if the optimisation is towards “get published” or “get elected” or “get side-payments” then the outcome will be different than if it’s towards truth. To take one example, I do not believe that the last government’s policy towards those on housing benefit would have been remotely the same if they thought that they, or their close family, would be affected by it. When the truth matters to us, we become truth-seeking creatures! When it doesn’t, we prefer other prey.
19 Stephen 03.23.16 at 7:30 pm: A cynical view would be that politicians, of all denominations, do not so much want evidence-based policies, but would much prefer policy-based evidence.

Something about reason being the slave of the passions comes to mind here.
20 King of Hearts 03.23.16 at 8:17 pm: OP, can you explain the extent to which you believe that policy cannot be effectively formed based on evidence. It seems like you’re simply in support of a govern best by governing least sort of society, or am I assuming too much?

I’m uncomfortable with the assertion that systematically observing the world, attempting to understand causality, and acting based on that, is not the best way to navigate existence. Maybe you’re not suggesting the contrary, but it sounds like it.
21 Brett Dunbar 03.23.16 at 9:14 pm: Interestingly economics has much less of a problem with reproducibility. A recent study on reproducibility

http://www.economist.com/news/science-and-technology/21693904-microeconomists-claims-be-doing-real-science-turn-out-be-true-far

SCIENCE works for two reasons. First, its results are based on experiments: extracting Mother Natureâ€™s secrets by asking her directly, rather than by armchair philosophising. And a culture of openness and replication means that scientists are policed by their peers. Scientific papers include sections on methods so that others can repeat the experiments and check that they reach the same conclusions.

That, at least, is the theory. In practice, checking old results is much less good for a scientistâ€™s career than publishing exciting new ones. Without such checks, dodgy results sneak into the literature. In recent years medicine, psychology and genetics have all been put under the microscope and found wanting. One analysis of 100 psychology papers, published last year, for instance, was able to replicate only 36% of their findings. And a study conducted in 2012 by Amgen, an American pharmaceutical company, could replicate only 11% of the 53 papers it reviewed.

Now it is the turn of economics. Although that august discipline was founded in the 18th century by Adam Smith and his contemporaries, it is only over the past few decades that its practitioners (some of them, anyway) have come to the conclusions that the natural sciences reached centuries ago: that experiments might be the best way to test their theories about how the world works. A rash of results in â€œmicroeconomicsâ€â€”which studies the behaviour of individualsâ€”has suggested that Homo sapiens is not always Homo economicus, the paragon of cold-blooded rationality assumed by many formal economic models.

But as economics adopts the experimental procedures of the natural sciences, it might also suffer from their drawbacks. In a paper just published in Science, Colin Camerer of the California Institute of Technology and a group of colleagues from universities around the world decided to check. They repeated 18 laboratory experiments in economics whose results had been published in the American Economic Review and the Quarterly Journal of Economics between 2011 and 2014.

For 11 of the 18 papers (ie, 61% of them) Dr Camerer and his colleagues found a broadly similar effect to whatever the original authors had reported. That is below the 92% replication rate they would have expected had all the original studies been as statistically robust as the authors claimedâ€”but by the standards of medicine, psychology and genetics it is still impressive.

One theory put forward by Dr Camerer and his colleagues to explain this superior hit rate is that economics may still benefit from the zeal of the newly converted. They point out that, when the field was in its infancy, experimental economists were keen that others should adopt their methods. To that end, they persuaded economics journals to devote far more space to printing information about methods, including explicit instructions and raw data sets, than sciences journals normally would.

This, the researchers reckon, may have helped establish a culture of unusual rigour and openness. Whatever the cause, it does suggest one thing. Natural scientists may have to stop sneering at their economist brethren, and recognise that the dismal science is, indeed, a science after all.

It seems that if you want to do evidence based policymaking in economics the research is fairly robust and reliable.
22 David 03.23.16 at 9:16 pm: Well, I can see at least three separate issues here. (1) Replicability of experiments as such (2) use of data generated by such experiments for policy-making and (3) policy-making which tries to be as rational as possible. Proper policy-making should be based, where possible, on the observation of large populations over long periods of time. The teaching of reading is a classic example, because experience in many countries has shown that some methods work, on the whole, better than others. So the “evidence ” suggests that if you use this method rather than that method, you will, on average and allowing for other factors, get a better result. That’s what evidence-based policy-making should really consist of.
Obviously it isn’t always like this, and governments can and do misuse data, statistics and studies to support the conclusion they wanted to reach anyway. This is because in politics governments like to present themselves as behaving according to some kind of rational criteria, and feel uncomfortable if they can’t. But don’t forget that not all policies are intended to be evidence based: many are based on normative criteria which are not influenceable by evidence. Governments feel that something is “the right thing to do” or is in tune with their ideology, and it’s on those grounds that they defend it. A recent example is the French government’s decision to put an effective end to the teaching of Greek and Latin in state schools except as an option for 16-18 year olds. There is no “evidence” behind this (how could there be, one way or the other?) but it has been justified on the grounds that classical languages are “elitist”. Such an assertion, whatever you think of it, is obviously not testable.
23 brad 03.23.16 at 9:25 pm: Spot on.
24 otpup 03.23.16 at 9:36 pm: Daniel @15. The problem with meta-analysis is that their results don’t mean much in the face of systemic bias (i.e., a bias that is widespread among the merged studies). The interesting thing about meta-analysis results is when they contradict “accepted” opinion because you would expect that the bias exists and tends to go in favor of conventional wisdom. So when a Cochrane Collab. meta-analysis contradicts an accepted fact, e.g. saturated fat in the diet causes heart attacks, it is worth paying attention to. (Aside, in the case of saturated fat, the results of the meta-analysis could have been easily anticipated since the experimental data is so, so weak).
25 James Wimberley 03.23.16 at 9:39 pm: Daniel #15: an unsourced and vague criticism of one Cochrane study is not exactly an evidence- based refutation of their approach. On partial reporting, see Ben Goldacre’s plea at a Cochrane conference for their support for his campaign that the results of all clinical trials must be published, negative or positive. There are few professional benefits to publish negative results, and strong commercial pressures not to, so it’s an important principle.
26 Paul Reber 03.23.16 at 9:42 pm: “Evidence based policy” recommendations still work for proper definitions of “evidence.” The “replicability crisis” is a complex issue, but here’s a simple heuristic if you want to know what you can rely on for actual policy/practice recommendations:

Don’t count on anything until it has been replicated. Preferably more than once.

A single p<.05 just isn't really reliable evidence. Two, three or five papers all finding conclusions in the same direction — that's evidence.

Full disclosure: I'm a professor of Psychology at a Tier 1 research university who theoretically makes his living doing the kind of work people are worrying about. I also teach research methods (and ethics) and have for nearly 2 decades. I'm personally not so sure that what is currently going on merits the term "crisis" but I also have no doubt that some percentage of published social science studies aren't correct (won't replicate).

There's really no magic bullet or simple procedural improvement that will protect us from error when people are working at the cutting edge of theory. Too much can go wrong when advancing the frontiers of knowledge.

So the first time you see something weirdly interesting, you should always think "cool, if true" no matter what the reported p-value is. And then wait to see it again. Policy should always be based on a volume of evidence accumulated across several studies and ideally multiple labs.

Really well done Randomized Clinical Trials (RCT) are potentially an exception to this as they are supposed to be done with preregistered hypotheses, careful power analysis and well-followed study procedures. Sadly, many RCTs aren't executed flawlessly but this has more to do with perverse incentives in some medical research.
27 otpup 03.23.16 at 9:43 pm: Correction: The Cochrane meta-analysis on sat. fat could have been predicted but the broader point is that any observational study, even if done well, well may be contradicted by experimental evidence. I know of one stats prof that will not let you pass his grad course unless you can cite a specific example of an observational study that was later contradicted by an experiment.
28 Hidari 03.23.16 at 10:02 pm: Never trust the precis of something in the Economist, especially not when it concerns their beloved ‘science’ of economics.

http://www.sciencemag.org/news/2016/03/about-40-economics-experiments-fail-replication-survey
29 A H 03.23.16 at 10:12 pm: @20 The problem with experimental economics is generalizability like BenK pointed out up-thread. There is little reason to trust that those results will translate into effects in the real world.

Evidence based economics is really very hard because to learn anything interesting you need to do time series analysis. But then you have the dual problems of bad data (especially at the macro level) and time series statistical techniques being weak and difficult. For these reasons, I’m very skeptical that the “empirical” turn in modern econ will result in much. People like Noah Smith push a kind of statistical positivistism, where truth is defined as a reduced form regression. Generalizability and various time series problems are going to come bite them in the next decade.
30 Faustusnotes 03.23.16 at 10:17 pm: Cochrane meta analyses include an assessment of the quality of the evidence, and typically leave out observational studies or studies with poor quality designs, though it depends on the interventions.

I am amused at the idea above that economics journals encourage publication of more extensive methods. Most economics articles I read don’t even have a methods section.

I’m not convinced that “evidence-based policy” is that related to statistical evidence. Sometimes it just means implementing policies that seem to have worked in other countries in the sense that they were implemented there and the problem got better or didn’t get worse. The kind of policy that depends on a couple of scientific studies seems very narrow and minor.

The big problem I see with the replicability crisis is in its effect on institutional decision making by key bodies like NICE, the CDC, etc. NICE has already been conned into depending on the mathematically, statistically and theoretically meaningless use of incremental cost effectiveness ratios (ICERs) for example, and drug regulatory bodies heavily depend on classical experimental design. This stuff affects policy piecemeal (don’t use intervention A, don’t fund drug B) but it builds up into a landscape of small policy changes that are threaded through all of the health system.

I see no evidence on he other hand that big policy schemes (eg whether to decriminalize prostitution or drugs, whether to support needle syringe programs, deinstitutionalization of the mentally ill) are related to these kinds of specific scientific evidence, and not to a broader body of theory and common sense (see eg Australia’s response to HIV as an example).
31 Ronan(rf) 03.24.16 at 12:06 am: Id say I don’t have any particularly high hopes for “evidence based” policy making, but then I also don’t have many problems. It seems to me policy is made by some combination of interests , politics, ideology and policy maker preferences, if you could skew that (even slightly ) by institutionalising some evidence based culture within policy making beauracracies then that seems a noble enough aspiration to me. That politicians and policy makers (and the rest of us )will choose the evidence to fit our priors doesn’t seem to me to negate the goal. Or at least it just implies it’s all very complicated , and an attempted shift to “evidence based” policy making just, at worst, gives people of ill repute better arguments for bad policy
32 Sebastian H 03.24.16 at 12:13 am: The replicability crisis has a number of prongs. From a layman’s perspective you should be aware of the following.

With studies that on inspection looked at a lot of variables but found that only one or two of them were ‘significant’. You can’t trust those studies until they have investigated the phenomenon again. P value >.05 means that if you study 20 things that aren’t significantly related to what you are studying, one of them is nevertheless likely to show up under the test. (It is worse than that, but that will do).

“Statistically significant” does not mean “important”. Lots of things which allegedly make it past significance don’t actually make meaningful changes. A bunch of the priming experiments are like that, and a huge number of the cancer scares are based on this confusion.

From a policy perspective a lot of things get blown up on tiny little things that are at very best statistically significant but unimportant.
33 Christopher London 03.24.16 at 1:26 am: Making replication the central conceit of the whole debate is, it seems to me, a complete waste of time. Human social processes are too heterogeneous and non-random for experiments to replicate anything above the trivial. And certainly when it comes to policy the idea of replication is a sop to technocrats who think too highly of themselves. The frequent failures of social policy around the world is not due to lack of experiments but due to the very idea that “policy makers” (whoever or whatever they are) can in any meaningful sense have sufficient knowledge of intensely local and interactive worlds to intervene effectively. The whole RCT push is simply another variation of the circumvention of democracy and centralization of decision making in a state apparatus (whether govt, Gates Fdn or what have you) that is anything but a machine to serve the interests of the polity. We certainly need to deploy solid research methods (whether quantitative or qualitative; personally I favor a mix of both) but the ends of that research, if we really want to facilitate self-determination and fulfillment of basic needs, needs to be deployed in and though the process of intervention or “implementation” rather than something that stands outside and presumes to direct. The goal shouldn’t be replication after all, we’re not machines, but power to decide for our selves in concert with others. As others here have said, it is woefully naive to ever think that decisions are anything but political with ‘evidence’ being just on factor among many. Better to simply embrace that and struggle to put the process of knowledge creation and decision making into democratic processes.
34 Icastico 03.24.16 at 2:02 am: Evidence and decision making have a complex relationship, but certainly we can look to many fields where evidence drives reasonable policy dcisions. Off the top of my head I can think of food safety protocols, universal precaution policies in healthcare, vaccination policies in schools, and DUI blood alcohol criteria. My take on the replication crisis is that the only crisis is the lack of demand/incentive for replication studies. I have recently published one and am in the midst of writing a grant for another. But when many grants include “innovation” as a required element to gain funding doing this kind of work is often an uphill battle. It shouldn’t be.
35 Peter T 03.24.16 at 3:14 am: Policy is often about finding the best mix of things, in the right sequence, changed as the situation evolves. Statistically-based studies are only very weakly illuminating here. A good example is anti-smoking campaigns. Advertising the dangers is not very effective at all, but it does lay the mental ground for tax rises and workplace curbs which, when accepted, allow further measures which lead to the widespread perception that smoking is something government should do something about which leads to restricted access for minors which couple with enforcement to constrict the pool of younger smokers which then allows further tax rises and further curbs – in an extended campaign which has typically seen smoking from from 70 per cent of adults to under 20 per cent. Which bit was most effective? No one in particular, but all working together, timed appropriately.
36 mclaren 03.24.16 at 5:46 am: Excellent point and particularly worrisome when companies like amazon or google start throwing Big Data and Bayesian statistics at problems and assuming the results will prove either meaningful or replicable.

There’s a neo-Platonic number mysticism at work here that presumes you can reduce anything to numbers. Once you get the numbers, you run ’em through sufficiently complex mathematical conjurations, and bingo! Now you’ve got some vector fitted to various points in multidimensional space. The vector is supposed to “optimize” something.

Three big problems with this approach. First, no evidence suggests that everything can be meaningfully reduced to numbers. It’s the issue a computer music composition or dating sites find themselves faced with: what’s the algorithm that reliably determines how good a piece of music is? What kind of mathematical procedure do you use to ensure that people will meet and fall in love? These questions are ridiculous of the immense path-dependency and situation-dependence of such efforts. As a counterexample to all the dating site math, consider psychologist Arthur Aron’s experiments showing that you can get pretty much anyone to fall in love with pretty much anyone else just by running them through a simple procedure involving 36 questions and staring into one another’s eyes. These results suggest that people “fall in love” when they decide to for reasons of a woman’s ticking biological clock, graduation from college and getting a secure job, etc., etc. All the algorithms in the world prove futile if the people involved are really serious about wanting to settle down in a long-term relationship. For the “beauty algorithm” or music or art, all we need is to listen to music from a radically different culture, like Balinese gamelan music. Safe to say that any equation that reliably tells us Beethoven’s symphonies are great music probably won’t also tell us that Balinese gamelan music is worth listening to.

The second problem involves the pitfalls of big data. Random matrix theory tells us that the number of false correlations rises exponentially as your data samples bulk up. Nasism Taleb dealt with this in his article “Beware the Big Errors of `Big Data,'” Wired magazine, February 2013. As data sets explode in size due to ginormous hard drives and vast server clusters, this problem is getting critical. As a real-world example, just recently a NY Times article cited whistleblower Edward Binney pointing out that the NSA is so swamped with data it can no longer do its job. The data sets have grown too big to extract meaningful correlations from without getting bogus artifacts like the ones ridiculed at the site Spurious Correlations.

The third problem involves the phony effort to scientize fields which are inherently fuzzy and squishy and people-centric. Foolish efforts like google’s attempt to use data-driven business management techniques run aground on a real world where employees know the metrics that are being used to measure them, and push back accordingly. The result? Akin to what happened when Secretary of Defense McNamara used Operations Research in the Vietnam war. On the surface, this approach seems to make sense…after all, OR got used productively in WW II to schedule convoys, measure torpedo performance, and set up antisubmarine warfare protocols. The problem is that when OR got applied in the Vietnam war, the numbers no longer came out of neat and tidy traditional warfare situations where the number or mines or submarines could be measured and the number of sunk ships in a convoy could be umambiguously defined. In the Vietnam war, “victory” and “enemy casualties” became slippy nebulous terms — if a platoon killed everyone in a Vietnamese village and counted them all as “Viet Cong casualties,” it turned the numbers into garbage, but the people doing the Operations Research math didn’t realize it.

The big push for data-driven policies today from private companies like google and in government agencies like Health & Human Services is shaping up to become a colossal disaster. It all puts me in mind of the fvckup that befell google when the company started to predict incidence of the flu from its search results.

Source: What we can learn from the epic failure of Google Flu Trends, Wired magazine, October 2015.

This kind of result occurs all the time when neo-Platonic number mysticism leads people to inadvisedly mathematize fundamentally non-numeric qualia like “customer satisfaction” or “romantic love” or “business efficiency.” It works OK for a couple of years, and then the math breaks down (very much the same way this kind of pseudoscientific mathematization tends to work all right for a short while in the stock market and then falls apart. Can anyone say “Long Term Capital” and the Black-Merton-Scholes option pricing equation, 1997?). You wind up with complex algorithms that tell you ENRON is a wonderfully efficient business but don’t tell you it’s because the company is a giant Ponzi scheme. You wind up with hideous hellmouth companies like Comcast feeling great because of their customer retention numbers, but the math doesn’t explain that the customer retention occurs because Comcast is a monopoly and the consumer has no choice if they want to get cable TV.

Most of all, you wind up with debacles like Bill Clinton’s 1996 “welfare reform” that destroyed poor people’s lives while producing shining happy numbers showing that everything was going swimmingly. We’re seeing the same thing from economists right now with all those numerics that “prove” the American economy is doing great, while people with two graduate degrees from prestigious Ivy league colleges wind up taking minimum-wage part-time jobs because the economy is in reality in the toilet and getting worse for everyone except the corporate CEOs and their billionaire stockholders.
37 Hidari 03.24.16 at 7:59 am: Yes but ‘ neo-Platonic number mysticism’ (neo-Pythagoreanism, really) is the secular religion of our age.

Daniel didn’t make the point explicitly, but one of the key ironies here is that number mysticism (and its cousin ‘scientism’*, the idea that only the methodologies used by the ‘hard’ sciences can lead to Truth with a capital ‘T’) is particularly common amongst New Atheists and self-professed ‘Skeptics’, who are normally the first to proclaim that they are materialists, not-mystics etc.

This is not always true, but it is frequently true.

*In the real world, there is frequently hardly any difference between these two belief systems, as the methodologies used in the natural sciences tend to need and produce quantitative data. So qualitative data is almost by definition of less value, in this worldview.
38 Chris Bertram 03.24.16 at 8:41 am: There seem to be lots of areas of policy where the initial claim of the linked article seems to be false. So, for the UK, it seems immigration policy is very little informed by evidence and in education policy enormous policy changes (turn all schools into academies) are pursued on no evidential basis at all. Tory policy on the disposal of social housing: evidence based? Obviously, I could extend the list.
39 JoeinCO 03.24.16 at 8:41 am: Someone should write Priors and Prejudice if it doesn’t already exist. As is pointed out in the article, the word “Bayesian” does not magically make for effective policy. Who decides which studies are “fragile”? Who decides the metrics of “goodness?”

Whenever I hear the coercive words “evidence-based” (How could you possibly be against something that is evidence based?) I immediately think that the speaker has an agenda that they are not telling me, usually involving my wallet.
40 Robespierre 03.24.16 at 10:45 am: Beats the alternatives
41 Daniel 03.24.16 at 10:49 am: I think that it can be formed based on evidence, but that this isn’t cheap or easy (we should have had some sort of clue about this based on the time & effort it takes to get a drug from first compound to FDA approval, versus the time it takes to start a war or overhaul the benefits system). And not so much that one should “govern less” as that one should change the way in which one governs less; deregulation is just as much of an experiment as regulation.

So what I’m opposed to is not “systematicall observing the world …” but the current practice of not really systematically observing the world, fixing the p-values so it looks like you did and then acting on a bunch of wild-assed guesses and calling them “evidence”.
42 Soullite 03.24.16 at 10:52 am: I won’t even read this. But I guarantee you it will be the same […

[perhaps this was satirical, perhaps not. But it ended up in an unrelated rant about CT being “feminist”, which pushed the decision to delete it over the line. Soullite, please don’t comment on this thread]
43 Daniel 03.24.16 at 11:09 am: James: an unsourced and vague criticism of one Cochrane study is not exactly an evidence- based refutation of their approach

If you want the source, it’s note 19 to this paper, which was linked from the interview I linked to above. Since that was an interview with John Ioannidis, I think it’s a bit much to call it a “vague criticism”; it’s a comprehensive review of the way in which the problems with replicability are even present in medicine. Also…

On partial reporting, see Ben Goldacreâ€™s plea at a Cochrane conference for their support for his campaign that the results of all clinical trials must be published, negative or positive.

I find Goldacre really frustrating in this area, because as you say, he knows how important it is to have full coverage and registration of methodology, but he continues to be a leading voice for “evidence based policy” in areas where it’s clear that this practice is not followed at all. I think (based on the few interactions I’ve had with him) that he holds a version of the “it’s better than nothing” view, which would be consistent with his general rather mindless attitude to social science in general and qualitative research in particular. But he certainly wouldn’t let a pharmaceutical company get away with saying that a partial database of post-hoc analysed trials was “better than nothing”.

Paul: I think it’s more serious than that. On my reading of Ioannidis’ work and Gelman’s explanations, it doesn’t seem clear to me that even ten papers with the same finding can necessarily be trusted, if they’re all using data selection and analysis methods which were decided after the data was gathered.

Chris: believe it or not, many of those education reforms are indeed “evidence based”, which is what got me looking at this area in the first place. Gove (and his advisor Dominic Cummings) got well into the field when they drew Ben Goldacre from the “policy guru of the month club” that they both seemed to have joined. IMO, the way in which the UK government has used “evidence based” rhetoric also ought to have given a lot of people a lot more pause than it did about whether it was a good idea to lend their name and reputation to a project without demanding a lot more control over it. Gove’s now looking at launching a lot of “evidence based” prison reforms, so I think the benefits of “evidence based” policy may be coming to immigration policy sooner than you think… iirc, compulsory English tuition is one of the early successes.
44 oldster 03.24.16 at 11:22 am: Isn’t there another lesson in this, about the lessons we draw from our own lives?

“I hope you learned from that experience!” our parents tell us, after we tried some childish trick that failed. “I hope you learned from that experience!” our best friend tells us after we break up with the mate they never liked.

But learning from our own experiences is a strategy even more at peril from the replicability crisis. There will probably be a small number of experiences from which we can generalize–if it was a mistake to step out of the first second-story window, it will probably be a mistake to step out of the next one.

But for the vast majority of life experiences, the best policy will be to draw no inferences at all. The experience was too particular, its details unrepeatable, and we have no way to know which ones were causally relevant and which were not. (“…so after that I decided: no more men with blond hair for me!”)

I hope you don’t learn from that experience. Better to write it off. You’ll never know why it happened, you’ll never identify the causally relevant features. If you could, that pattern probably won’t emerge again, or if it does you won’t recognize it in its new configuration.
45 Brett Dunbar 03.24.16 at 11:51 am: LTCMs problem wasn’t the Black-Scholes equation. What had happened is everyone else had adopted it so the systemic mispricing of bonds had ceased. LTCMs business model had been make bets on the bonds converging on the theoretically correct price as they approached maturity that ceased to work as everyone else adopted the same model for determining what the rational price should be. LTCM began investing in other areas and it was that that brought down the fund when Russia defaulted in 1998.

The Black-Scholes equation still works. It just isn’t a useable investment strategy as everyone uses it to spot mispricing so the mispricing doesn’t occur.
46 Daniel 03.24.16 at 12:17 pm: Actually, I should significantly qualify the comments I made about Ben Goldacre in #43. I do think he isn’t nearly careful enough in how his reputation is used by the Michael Goves of this world, but it’s his reputation and therefore, obviously his decision. His own document on randomised controlled trials in policy is quite good, although in my opinion the discussion on p30 of subgroup analyses isn’t strongly worded enough, and he regards preregistration of methodology and data selection as a nice-to-have rather than an absolute requirement. However in his paper on evidence-based policy in education, he does (p16) specifically say that the research which is published in technical journals is the sort of thing which ought to inform teachers’ methodology and recommends the American “Doing What Works Clearinghouse”. I think the two approaches are totally different from the point of view of reproducibility.
47 Mary 03.24.16 at 12:40 pm: Some historians at York University in psychology have been tracking this over time, giving a deeper historical perspective on the replication “crisis”:

https://psyborgs.github.io/projects/replication-in-psychology/

Really interesting stuff.
48 Trader Joe 03.24.16 at 1:01 pm: @45
“The Black-Scholes equation still works. It just isnâ€™t a useable investment strategy as everyone uses it to spot mispricing so the mispricing doesnâ€™t occur.”

That’s not quite right. It remains a highly useable investment strategy and underpins virtually all of the rapid-trading algorithms that are in use today. What’s changed is that the usefulness of this knowledge is now measured in nano-seconds, whereas in the 1990s mispricing could persist much loner. The change is the availabiltiy and sophistication of computing power, not the usefulness of the formula.
49 Chris Bertram 03.24.16 at 1:16 pm: Just a mini-qualification of what I wrote above. Evidence *is* used in UK immigration policy in the form of having specific policies evaluated by a committee of economists, the MAC. What MAC gets to do, however is (a) to answer only the highly specific question posed to them by government (the answer is then recycled for legitimation purposes as coming from an “independent” expert body and (b) to evaluate those policies in terms of the net financial effects on a very specific group of people only (existing British resident citizens). Considerations of rights, or of the well-being of any wider group, or consideration of whether money is a good proxy for well-being are excluded.
50 David Simmons 03.24.16 at 4:30 pm: I am not an MD but worked in IT and bioinformatics in an academic medical department for 18+ years. But my personal experience with doctors leads me even more to the position that medicine is more art than science, and, as the Institute for Medicine study of a few years back noted, the replication crisis is very real for what is often termed to be a hard science. Seeing three different ENT specialists (in the same clinic) for sinus/allergy problems gave me three different diagnoses, from, “Let’s operate!” to “You don’t have a problem.”
On a different note, neuroscience, all the rage with fMRI studies claiming to “understand” what is going on in the brain (sometimes they even say the M word, mind), strikes me as being at the level of the black and white films they showed us in grade and middle school in the late ’50s and early ’60s where “protoplasm” was explanation for what was going on in cells …
51 Paul Reber 03.24.16 at 6:40 pm: Thanks for replying. Of course I’m aware of Ioannidis’s and Gelman’s concerns, but I don’t think we’ve seen any “finding” with a large literature behind it collapse due to statistical issues related to power, “p-hacking” or data selection. Those kinds of things will get you 1-2 papers in an area (usually from the same lab) but don’t produce the kind of volume that should drive policy.

As a technical matter, the issue is that practices with excessive ‘analytical flexibility’ artificially enhance the observed effect size. This can create false positives when a true effect size is zero and it appears reliable. But that’s still rare and doesn’t produce a volume of findings (it takes a bit of luck in addition to weak methods). More common is cases where the effect size is small, e.g., d=0.2, but we see a handful of reports where the effect size ends up in the 0.4-0.8 range. Note that in this case it is a true effect, but it’s just smaller than the published literature initially suggests (this often gets sorted out in meta-analysis later on — again, a volume of findings is much more informative).

The places where a volume of research collapses are more likely to come from more basic flaws in the experimental design that get repeated. For example, you want to measure “health outcomes” based on access to health insurance but the only measure you have is blood glucose. That’s not a very good “operational definition” of health outcome because it misses a lot of other health issues. Evaluating a result from a study with a weak operational definition doesn’t require statistical analysis, it just requires going past the headline to look at the methods and make sure they authors actually measured what they described it as.

Which is why I teach the undergraduate in my Research Methods class how to read those methods sections in the papers — so they can themselves spot the cases where the data aren’t really the same as the conclusions.
52 geo 03.24.16 at 8:00 pm: Daniel @16: there can be other good reasons to do things

Such as

Faustusnotes @30: theory and common sense

for example?

What strikes a statistically illiterate, reflexively left-wing bystander is how regularly and predictably calls for regulation of profitable activities like selling cigarettes, dumping fertilizer and other pollutants, chemical manufacture that releases particulates, marketing sugared drinks and snacks, buying political influence, and many others are
fended off with often bogus and always well-funded claims of “not enough evidence.” (See, e.g., Michaels, Doubt Is Their Product and Oreskes and Conway, Merchants of Doubt. To what extent, if any, does this call for some sense of political responsibility on the part of researchers for the way their findings are put to use? And if this is not their problem, whose should it be?
53 William Timberman 03.24.16 at 10:24 pm: To those of us at the pointy end, many of what are being advertised as evidence-based policies look a lot more like either malevolence or madness — Arne Duncan’s education policies here in the US, for example, or the European Commission’s remedy for Greek indebtedness. Lacking any influence over policy decisions other than that which comes of voting for the greater or lesser evil, sitting on folding chairs in a cinder-block church basements after work listening to some local blowhard practice his elocution, or waving signs in a crowd hemmed in by armored police ninjas, we can only scratch our heads and wonder. Evidence? Evidence of what, exactly, other than der Untergang des Abendlandes.

Is it too much to ask that those who have the influence we lack to be competent, to acknowledge the public interest, to grasp the significance of their stewardship, and to understand the difference between wheat and chaff, evidence or no evidence? Apparently it is.
54 Faustusnotes 03.24.16 at 11:02 pm: Geo and William, I don’t think pointing to corrupt misuse of evidence is a good counter to Daniels point, which I read as more about honest misuse of weak evidence. I take it as read that everyone here understands politicians seek evidence to justify their prior political beliefs. I take Daniel to be complaining about the value of evidence for policy given its poor replicability. But to respond to the tobacco example, we do have a good, solid evidence based global tobacco policy, the framework convention on tobacco control, that is largely not based on p values, and it was during the long quest to establish smoking causes cancer that we developed the modern evidentiary rules for establishing a causative relationship in epidemiology . Those rules, incidentally, don’t mention p values. Tobacco policy is a very good example of evidence based policy formed without RCTs. That big tobacco fought it is not evidence against the value of that kind of policy making.

As an example of evidence based policy that is much more dependent on flawed studies and the use of p values, consider the work to establish good policy for nuclear disaster response in the wake of Fukushima (which I have been involved in). Obviously no RCTs there, and none of our experiments could be planned in any way, or even any preparation made for data collection or study siting. But our research is the only evidence we have to inform future responses in aging societies. Our findings of no radiation risk and high evacuation mortality have been contentious (you should have seen some of the peer reviews!) but ultimately our findings depended on simple basic statistical testing. Bayesian methods wouldn’t have saved them because assumptions about prior distributions would be hugely controversial. We couldn’t do anything about the power of the studies, because the affected populations are fixed. But I believe that through careful analytical design we showed solid, robust facts that will shape future policy even though the studies were flawed. And this is important because the next disaster will be in China and if they don’t base policy on our evidence people will die needlessly. I think similar things about my past research on Australia’s heroin shortage: a big event that happened once and gave us a chance to rethink the relationship between harm reduction and prohibition using evidence, with results that surprised everyone involved. The evidence from these two events is too important to discard because it wasn’t collected through carefully pre planned trials, or didn’t use Bayesian techniques!

If there are similarly flawed but important bodies of evidence in psychology and the social sciences we shouldn’t throw the evidence based policy baby out with the tainted bath water.
55 William Timberman 03.24.16 at 11:43 pm: faustusnotes@55

I take your point, and Daniel’s as well, and all else being equal, I’d agree that the applicability to policy of experimental evidence that can’t be replicated deserves serious scrutiny. The problem is that all else isn’t equal. The social science research results now being viewed with increased skepticism seem less often to be a consequence of poorly-designed research, or of theorists whose reach exceeds their grasp, than of a kind of lazy in-group arrogance — being overly certain that one’s methodology is sound, and not deigning to look outside it for answers that it isn’t designed to uncover.

It’s not so much that researchers are necessarily suborned by peer-pressure or by their funding sources, except in some egregious — and more obvious — cases, but that they’ve put all their eggs in one basket, and having done so, aren’t about to let anyone outside the club accuse them of being unprofessional, or dispute their modus operandi.
56 geo 03.25.16 at 2:43 am: Thanks, Faustusnotes, especially for that long, informative second paragraph. I wasn’t, though, trying to counter Daniel’s point so much as tossing out a waspish, vaguely related observation about the misuse of the rhetoric of “good science” to frustrate environmental, public health, and other regulations and reforms. One example is what you refer to as “the long quest to establish smoking causes cancer.” It was in fact a great deal longer than it needed to be, and far longer in having the public-policy consequences it ought to have had, because tobacco companies convinced gullible citizens and politicians time and again that “more research was needed.” Other (though less flagrant and lethal) examples are abundant. There does seem to be a chronic tension between — perhaps oversimplifying a little — private interests brandishing the “evidence-based” banner and those advocating a more precautionary approach in the public interest.

This is not the problem Daniel was addressing, I recognize. I hope mentioning it hasn’t been a distraction from the very interesting previous discussion.
57 Hidari 03.25.16 at 9:37 am: http://www.slate.com/articles/health_and_science/science/2016/03/the_psychology_replication_crisis_could_be_due_to_sketchy_practices_in_the.html
58 Faustusnotes 03.25.16 at 10:23 am: Actually I think that’s a good example of how things like the replication crisis can be used to attack legitimate evidence based policy making. Simon Jenkins in the guardian (who used to be an HIV denier) is always talking up uncertainty and scientific inaccuracy when he advocates weaker risk management on influenza and dietary risks. This kind of debate is grist for the mill of do-nothing advocates like him.
59 Peter T 03.25.16 at 11:16 am: My policy rule of thumb used to be that if it took more than simple stats to show a positive effect, it was not worth pursuing. There were areas where advanced stats were useful – primarily in showing that the current approach was doing no better than random action. But selling these (to my mind really important) results was really, really hard. The bias to action is quite strong, and couples with the attitude that says that any idea is better than no idea.
60 engels 03.25.16 at 11:37 am: Moderated.
61 bruce wilder 03.26.16 at 3:11 am: Surfing about idly, I read this interesting thread, and ruminated a few seconds and moved on to Atrios of Eschaton, where I found myself staring at an ad for Transcendental Meditation, which was touted as . . . yes, scrawled into the ad: “evidence-based”.

Creepy? A bit. But, was the ad placement evidence-based? Hmmm.

The formalism of p-tests and even the seemingly antithetical view expressed in the OP has a whiff about it of white lab coats and Latin inscriptions framed on the wall. There is an idea that formalism for its own sake will redeem our efforts in circumstances where that is highly unlikely.

I do not object to fishing in hope of catching a big one. I would not rule fishing expeditions out of bounds for scientists. In a casino, play till you win and are ahead, then stop, and hedge that strategy with a rule that says, stop also if you lose your limit. That is a good strategy, if you want to have a little fun at a casino. You don’t learn anything about the casino doing that. It would be a lousy way to go about testing the integrity of the casino’s promised odds. And the normal operation of a roulette wheel or a game of black jack is not interesting.

The ordinary practice of interpretation and testing interpretation by extension deserves more attention. That is what makes replication interesting: the small variations in approach and pov, that allows knowledge to escape subjective dominance. And maybe escape too, post hoc, propter hoc.
62 Peter T 03.26.16 at 5:52 am: How far does the low rate of replicability of findings reflect the low rate of replicability of the real world? Faustusnotes’ side point about the Australian heroin shortage illustrates the point: law enforcement had a very marked effect on heroin usage, which saved some thousands of lives. But it was a particular conjunction of circumstances that allowed it to do so. The analysis was illuminating, but there were no easily transferable lessons for other places or even other drugs in Australia.
63 sanbikinoraion 03.27.16 at 8:30 pm: Someone needs to start up a Journal Of Reproducability, which *only* carries papers that verify other papers…
64 PGD 03.28.16 at 10:02 am: Paul Reber @ 53 — re your claim that no ‘large literature’ has collapsed — what about the literature on ego depletion which is now under attack? See:

http://www.slate.com/articles/health_and_science/cover_story/2016/03/ego_depletion_an_influential_theory_in_psychology_may_have_just_been_debunked.html

I think Daniel’s post is fantastic and very important.
65 PGD 03.28.16 at 10:07 am: As an example of a potential failure of ‘evidence based policy making’, there is increasing evidence that mandatory pre-k education has negative effects on many children. Yet the evidence base for ‘early intervention’ was/is so strong that as a principle it has practically reached the status of a popular religion.

A fundamental issue is that the generation and interpretation of evidence in social ‘sciences’ turns out to be heavily ideological and fad-driven, yet the ‘evidence based’ movement is touted as a way to circumvent ideology and instinctual response. The effect this has is to turn off many of the benefits that could come from back-and-forth weighing of the evidence between people who frankly admit and describe their various prejudices.
66 dporpentine 03.28.16 at 10:52 am: This seems like a very UK-specific view of evidence-based policy.

In the US, at least in the social services sector, most policy is a result of layers of statutes and regulations that so utterly control policy that there’s only a tiny room for movement. Add to that the partisan divide and you’ve basically got no opportunities to move.

I mean, just look at what MDRC does in their randomized controlled tests: they’re forced to fuss around the fringes of policy, using very basic behavioral economic nudges that don’t hurt but may not do very much good.

Public policy in the US is (a) mostly terrible–like crisis-level terrible, causing a lot of unnecessary suffering–and not incidentally (b) driven primarily by racism and misogyny. “Evidence-based” may not be a panacea, and it may be subject to abuse, but it gives a tiny opening for reason in a debate that here in the US is primarily driven by the crazed resentment of the Limbaughs of this world.
67 RNB 03.28.16 at 11:55 pm: Had been in Japan, so have not followed the discussion carefully. Here’s an example of an attempt to figure out a policy package on the basis of evidence. One can smell the whiff of doom and desperation in the air though as Krugman discusses stabilization policy with Prime Minister Abe and Minister of Finance Aso. Remarkable discussion
https://www.gc.cuny.edu/CUNY_GC/media/LISCenter/pkrugman/Meeting-minutes-Krugman.pdf
68 RNB 03.29.16 at 12:04 am: In terms of the OP, it’s remarkable to me how much work FDR’s step back from fiscal stimulus in 1937 is made to do. It is the biggest piece of negative evidence that fiscal policy works due to what the bad results were from stepping back from stimulative fiscal policy. In other words, the biggest piece of evidence is a counterfactual. Seems like a very thin evidentiary base.
69 harry b 03.29.16 at 12:38 am: Brett — its much more complicated than that. The evidence as I understand it is that effects on test scores fade fast, as you suggest, but then has benefits for such things as high school graduation, non-involvement in criminal justice, college matriculation, etc. Not large benefits, but well worth the expense. Remember that many of the students in the study you link to who were not in Head Start received other kinds of intervention, largely government subsidized. Its an evaluation of Head Start, not of early childhood interventions versus nothing at all.
70 RNB 03.29.16 at 2:30 am: @72 Yes that is my understanding too of James Heckman’s findings, though I suspect harry b who I believe is one Heckman’s respondents in Giving Kids a Fair Chance knows quite a bit more about this than I do. Some programs do yield statistically significant results difficult as that is to get with small sample sizes. And it is more than just that. Heckman also argues that the cost benefit analysis also meets a sensitivity analysis, giving us good reason to believe that in well-designed programs the benefits more than justify their costs. It’s no argument against the effectiveness of early childhood education to say that in big, albeit poorly designed, programs the benefits do not justify the costs. Moreover, the control groups in big sample studies often also had access to early childhood education.

Comments on this entry are closed.

“Evidence based policy” and the replicability crisis

Recent Comments

Search

Archives

Pages

Book Events

Contributors

Fine Print

Lumber Room

Old Wood

Meta

Recent Posts

Tags