Well, the Lancet study has been out for a while now, and it seems as good a time as any to take stock of the state of the debate and wrap up a few comments which have hitherto been buried in comments threads. Lots of heavy lifting here has been done by Tim Lambert and Chris Lightfoot; I thoroughly recommend both posts, and while I’m recommending things, I also recommend a short statistics course as a useful way to spend one’s evenings (sorry); it really is satisfying to be able to take part in these debates as a participant and I would imagine, pretty embarrassing and frustrating not to be able to. As Tim Lambert commented, this study has been “like flypaper for innumerates”; people have been lining up to take a pop at it despite being manifestly not in possession of the baseline level of knowledge needed to understand what they’re talking about. (Being slightly more cynical, I suggested to Tim that it was more like “litmus paper for hacks”; it’s up to each individual to decide for themselves whether they think a particular argument is an innocent mistake or not). Below the fold, I summarise the various lines of criticism and whether they’re valid or (mostly) not.
Starting with what I will describe as “Hack critiques”, without prejudice that they might in isolated individual cases be innocent mistakes. These are arguments which are purely and simply wrong and should not be made because they are, quite simply, slanders on the integrity of the scientists who wrote the paper. I’ll start with the most widespread one.
The Kaplan “dartboard” confidence interval critique
I think I pretty much slaughtered this one in my original Lancet post, but it still spread; apparently not everybody reads CT (bastards). To recap; Fred Kaplan of Slate suggested that because the confidence interval was very wide, the Lancet paper was worthless and we should believe something else like the IBC total.
This argument is wrong for three reasons.
1)The confidence interval describes a range of values which are “consistent” with the model. But it doesn’t mean that all values within the confidence interval are equally likely, so you can just pick one. In particular, the most likely values are the ones in the centre of a symmetrical confidence interval. The single most likely value is, in fact, the central estimate of 98,000 excess deaths. Furthermore, as I pointed out in my original CT post, the truly shocking thing is that, wide as the confidence interval is, it does not include zero. You would expect to get a sample like this fewer than 2.5 times out of a hundred if the true number of excess deaths was less than zero (that is, if the war had made things better rather than worse).
2)As the authors themselves pointed out in correspondence with the management of Lenin’s Tomb,
“Research is more than summarizing data, it is also interpretation. If we had just visited the 32 neighborhoods without Falluja and did not look at the data or think about them, we would have reported 98,000 deaths, and said the measure was so imprecise that there was a 2.5% chance that there had been less than 8,000 deaths, a 10% chance that there had been less than about 45,000 deaths,….all of those assumptions that go with normal distributions. But we had two other pieces of information. First, violence accounted for only 2% of deaths before the war and was the main cause of death after the invasion. That is something new, consistent with the dramatic rise in mortality and reduces the likelihood that the true number was at the lower end of the confidence range. Secondly, there is the Falluja data, which imply that there are pockets of Anbar, or other communities like Falluja, experiencing intense conflict, that have far more deaths than the rest of the country. We set that aside these data in statistical analysis because the result in this cluster was such an outlier, but it tells us that the true death toll is far more likely to be on the high-side of our point estimate than on the low side.”
That is, the sample contains important information which is not summarised in the confidence interval, but which tells you that the central estimate is not likely to be a massive overestimate. The idea that the central 98,000 number might be an underestimate seemed to have blown the mind of a lot of commentators; they all just seemed to act like it Did Not Compute.
3. This gave rise to what might be called the use of “asymmetric rhetoric about a symmetric confidence interval”, but which I will give the more catchy name of “Kaplan’s Fallacy”. If your critique of an estimate is that the range is too wide, then that is one critique you can make. However, if this is all you are saying (“this isn’t an estimate, it’s a dartboard”), then intellectual honesty demands that you refer to the whole range when using this critique, not just the half of it that you want to think about. In other words, it is dishonest to title your essay “100,000 dead – or 8,000?” when all you actually have arguments to support is “100,000 dead – or 8,000 – or 194,000?”. This is actually quite a common way to mislead with statistics; say in paragraph 1 “it could be more, it could be less” and then talk for the rest of the piece as if you’ve established “it’s probably less”.
The Kaplan piece was really very bad; as well as the confidence interval fallacy, there are the germs of several of the other fallacious arguments discussed below. It really looks to me as if Kaplan had decided he didn’t want to believe the Lancet number and so started looking around for ways to rubbish it, in the erroneous belief that this would make him look hard-headed and scientific and would add credibility to his endorsement of the IBC number. I would hazard a guess that anyone looking for more Real Problems For The Left would do well to lift their head up from the Bible for a few seconds and ponder what strange misplaced and hypertrophied sense of intellectual charity it was that made Kaplan, an antiwar Democrat, decide to engage in hackish critiques of a piece of good science that supported his point of view.
The cluster sampling critique
There are shreds of this in the Kaplan article, but it reached its fullest and most widely-cited form in a version by Shannon Love on the Chicago Boyz website. The idea here is that the cluster sampling methodology used by the Lancet team (for reasons of economy, and of reducing the very significant personal risks for the field team) reduces the power of the statistical tests and makes the results harder to interpret. It was backed up (wayyyyy down in comments threads) by people who had gained access to a textbook on survey design; most good textbooks on the subject do indeed suggest that it is not a good idea to use cluster sampling when one is trying to measure rare effects (like violent death) in a population which has been exposed to heterogeneous risks of those rare events (ie; some places were bombed a lot, some a little and some not at all).
There are two big problems with the cluster sampling critique, and I think that they are both so serious that this argument is now a true litmus test for hacks; anyone repeating it either does not understand what they are saying (in which case they shouldn’t be making the critique) or does understand cluster sampling and thus knows that the argument is fallacious. The problems are:
1)Although sampling textbooks warn against the cluster methodology in cases like this, they are very clear about the fact that the reason why it is risky is that it carries a very significant danger of underestimating the rare effects, not overestimating them. This can be seen with a simple intuitive illustration; imagine that you have been given the job of checking out a suspected minefield by throwing rocks into it.
This is roughly equivalent to cluster sampling a heterogeneous population; the dangerous bits are a fairly small proportion of the total field, and they’re clumped together (the mines). Furthermore, the stones that you’re throwing (your “clusters”) only sample a small bit of the field at a time. The larger each individual stone, the better, obviously, but equally obviously it’s the number of stones that you have that is really going to drive the precision of your estimate, not their size. So, let’s say that you chuck 33 stones into the field. There are three things that could happen:
a)By bad luck, all of your stones could land in the spaces between mines. This would cause you to conclude that the field was safer than it actually was.
b)By good luck, you could get a situation where most of your stones fell in the spaces between mines, but some of them hit mines. This would give you an estimate that was about right regarding the danger of the field.
c)By extraordinary chance, every single one of your stones (or a large proportion of them) might chance to hit mines, causing you to conclude that the field was much more dangerous than it actually was.
How likely is the third of these possibilities (analogous to an overestimate of the excess deaths) relative to the other two? Not very likely at all. Cluster sampling tends to underestimate rare effects, not overestimate them.
And 2), this problem, and other issues with cluster sampling (basically, it reduces your effective sample size to something closer to the number of clusters than the number of individuals sampled) are dealt with at length in the sampling literature. Cluster sampling ain’t ideal, but needs must and it is frequently used in bog-standard epidemiological surveys outside war zones. The effects of clustering on standard results of sampling theory are known, and there are standard pieces of software that can be used to adjust (widen) one’s confidence interval to take account of these design effects. The Lancet team used one of these procedures, which is why their confidence intervals are so wide (although, to repeat, not wide enough to include zero). I have not seen anybody making the clustering critique who as any argument at all from theory or data which might give a reason to believe that the normal procedures are wrong for use in this case. As Richard Garfield, one of the authors, said in a press interview, epidemics are often pretty heterogeneously distributed too.
There is a variant of this critique which is darkly hinted at by both Kaplan and Love, but neither of them appears to have the nerve to say it in so many words. This would be the critique that there is something much nastier about the sample; that it is not a random sample, but is cherry-picked in some way. In order to believe this, if you have read the paper, you have to be prepared to accuse the authors of telling a disgusting barefaced lie, and presumably to accept the legal consequences of doing so. They picked the clusters by the use of random numbers selected from a GPS grid. In the few cases in which this was logistically difficult (read: insanely dangerous), they picked locations off a map and walked to the nearest household). There is no realistic way in which a critique of this sort can get off the ground; in any case, it affected only a small minority of clusters.
The argument from the UNICEF infant mortality figures
I think that the source for this is Heiko Gerhauser, in various weblog comments threads, but again it can be traced back to a slightly different argument about death rates in the Kaplan piece. The idea here is that the Lancet study finds a prewar infant mortality rate of 29 per 1000 live births and a postwar infant mortality rate of 54 per 1000 live births. Since the prewar infant mortality rate was estimated by UNICEF to be over 100, this (it is argued) suggests that the study is giving junk numbers and all of its conclusions should be rejected.
This argument was difficult to track down to its lair, but I think we have managed it. One weakness is similar to the point I’ve made above; if you believe that the study has structurally underestimated infant mortality, then isn’t it also likely to have underestimated adult mortality? The authors discuss a few reasons why the movement in infant mortality might be exaggerated (mainly, issues of poor recall by the interview subjects), though, and it is good form to look very closely at any anomalies in data.
Which is what Chris Lightfoot did.
Basically, the UNICEF estimate is quoted as a 2002 number, but it is actually based on detailed, comprehensive, on-the-ground work carried out between 1995 and 1999 and extrapolated forward. The method of extrapolation is not one which would take into account the fact that 1999 was the year in which the oil-for-food program began to have significant effects on child malnutrition in Iraq. No detailed on-the-ground survey has been carried out since 1999, and there is certainly no systematic data-gathering apparatus in Iraq which could give any more solid number. The authors of the study believe that the infant mortality rates in neighbouring countries are a better comparator than pre-oil for food Iraq, and since one of them is Richard Garfield, who was acknowledged as the pre-eminent expert on sanctions-related child deaths in the 1990s, there is no reason to gainsay them.
I’d add to Chris’ work a theory of my own, based on the cluster sampling issue discussed above. Infant mortality is rare, and it is quite possibly heterogeneously clustered in Iraq (not least, post-war, a part of the infant mortality was attributed to babies being born at home because it was too dangerous to go to hospital). So it’s not necessarily the case that one needs to have an explanation of why they might have been undersampled in this case. Since this undersampling would tend to underestimate infant mortality both before and after the war, it wouldn’t necessarily bias the estimate of the relative risk ratio and therefore the excess deaths. I’d note that my theory and Chris’s aren’t mutually exclusive; I suspect that his is the main explanation.
We now move into the area of what might be called “not intrinsically hack” critiques. These are issues which one could raise with respect to the study which are not based on either definite or likely falsehoods, but which do not impugn the integrity of the study, and which are not themselves based on evidence strong enough to make anyone believe that the study’s estimates were wrong unless they thought so anyway.
There are two of these that I’ve seen around and about.
The first might be called the “Lying Iraqis” theory. This would be the theory that the interview subjects systematically lied to the survey team. In fact, the team did attempt to check against death certificates in a subsample of the interviews and found that in 81% of cases, subjects could produce them. This would lead me to believe that there is no real reason to suppose that the subjects were lying. Furthermore, I would suspect that if the Iraqis hate us enough to invent deaths of household members to make us look bad in the Lancet, that’s probably a fairly serious problem too. However, the possibility of lying subjects can’t be ruled out in any survey, so it can’t be ruled out in this one, so this critique is not intrinsically hackish. Any attempt to bolster it either with an attack on the integrity of the researchers, or with a suggestion that the researchers mainly interviewed “the resistance” (they didn’t), however, is hack city.
The second, which I haven’t really seen anyone adopt yet, although some people looked like they might, could be called the “Outlier theory”. This is basically the theory that this survey is one gigantic outlier, and that a 2.5% probability event has happened. This would be a fair enough thing to believe, as long as one admitted that one was believing in something quite unlikely, and as long as it wasn’t combined with an attack on the integrity of the Lancet team.
Finally, we come onto two critiques of the study which I would say are valid. The first is the one that I made myself in the original CT post; that the extrapolated number of 98,000 is a poor way to summarise the results of the analysis. I think that the simple fact that we can say with 97.5% confidence that the war has made things worse rather than better is just as powerful and doesn’t commit one to the really quite strong assumptions one would need to make for the extrapolation to be valid.
The second one is one that is attributable to the editors of the Lancet rather than the authors of the study. The Lancet’s editorial comment on the study contained the phrase “100,000 civilian deaths”. The study itself counts excess deaths and does not attempt to classify them as combatants or civilians. The Lancet editors should not have done this, and their denial that they did it to sensationalise the claim ahead of the US elections is unconvincing. This does not, however, affect the science; to claim that it does is the purest imaginable example of argumentum ad hominem
Finally, beyond the ultra-violet spectrum of critiques are those which I would classify as “beyond hackish”. These are things which anyone who gave them a moment’s thought would realise are irrelevant to the issue.
In this category, but surprisingly and disappointingly common in online critiques, is the attempt to use the IBC numbers as a stick to beat the Lancet study. The two studies are simply not comparable. One final time; the Iraq Body Count is a passive reporting system, which aims to count civilian deaths as a result of violence. Of course it is going to be lower than the Lancet number. Let that please be an end of this.
And there are a number of odds and ends around the web of the sort “each death in this study is being taken to stand for XXYY deaths and that is ridiculous”. In other words, arguments which, if true, would imply that there could be no valid form of epidemiology, econometrics, opinion polling, or indeed pulling up a few spuds to see if your allotment has blight. This truly is flypaper for innumerates.
I would also include in this category attempts like that of the Obsidian Order weblog to chaw down the 98,000 number by making more or less arbitrary assumptions about what proportion of the excess deaths one might be able to call “combatants” and thus people who deserved to die. This is exactly what people accuse the Lancet of doing; it’s skewing a number by means of your own subjective assessment. Not only is there no objective basis for the actual subjective adjustments that people make, but the entire distinction between combatants and civilians is one which does not exist in nature. As a reason for not caring that 98,000 people might have died, because you think most of them were Islamofascists, it just about passes muster. As a criticism of the 98,000 figure, it’s wretched.
Finally, there is the strange world of Michael Fumento, a man who is such a grandiose and unselfconscious hack that he brings a kind of grandeur to the role. I can no more summarise what a class A fool he’s made of himself in these short paragraphs than I could summarise King Lear. Read the posts on Tim’s site and marvel. And if your name is Jamie Doward of the Guardian, have a word with yourself; not only are you citing blogs rather than reading the paper, you’re treating Flack Central Station as a reliable source!
The bottom line is that the Lancet study was a good piece of science, and anyone who says otherwise is lying. Its results (and in particular, its central 98,000 estimate) are not the last word on the subject, but then nothing is in statistics. There is a very real issue here, and any pro-war person who thinks that we went to war to save the Iraqis ought to be thinking very hard about whether we made things worse rather than better (see this from Marc Mulholland, and a very honourable mention for the Economist). It is notable how very few people who have rubbished the Lancet study have shown the slightest interest in getting any more accurate estimates; often you learn a lot about people from observing the way that they protect themselves from news they suspect will disconcert them.
This is not the place for a discussion of Bayesian versus frequentist statistics. Stats teachers will tell you that it is a fallacy and wrong to interpret a confidence interval as meaning that “there is a 95% chance that the true value lies in this range”. However, I would say with 95% confidence that a randomly selected stats teacher would not be able to give you a single example of a case in which someone made a serious practical mistake as a result of this “fallacy”, so I say think about it this way.
Pedants would perhaps object that the more common mines are in the field, the less the tendency to underestimate. Yes, but a) by the time you got to a stage where an overestimate became seriously likely, you would be talking not about a minefield, but a storage yard for mines with a few patches of grass in it and b) we happen to know that violent death in Iraq is still the exception rather than the norm, so this quibble is irrelevant.
And quite rightly so; if said in so many words, this accusation would clearly be defamatory.
That is, they don’t go out looking for deaths like the Lancet did; they wait for someone to report them. Whatever you think about whether there is saturation media coverage of Iraq (personally, I think there is saturation coverage of the green zone of Baghdad and precious little else), this is obviously going to be a lower bound rather than a central estimate, and in the absence of any hard evidence about casualties there is no reason at all to suppose that we have any basis other than convenient subjective air-pulling to adjust the IBC count for how much of an undersample we might want to believe they are making.