I was a little puzzled by something Kos said in discussing the latest polling from New Hampshire. The poll has Dean at 28% and Kerry at 21%, among a sample of 600 voters. The poll officially has a margin of error of 4%, so Kos was unwilling to call it a clear lead for Dean. This policy strikes me as rather conservative.

One might try reasoning as follows. It’s conceivable, given the poll numbers and the MOE, that Dean is as low as 24%. And it’s conceivable, given the poll numbers and the MOE, that Kerry is as high as 25%. So it’s conceivable, given the poll numbers and the MOE, that Kerry is above Dean.

I think this really is how poll readers often reason, and it’s clearly invalid. What would justify the last step is that if it was conceivable that Dean was at 24% and Kerry was at 25%. But that doesn’t follow from the data we have. From knowing two things are individually conceivable, it doesn’t follow that it’s conceivable they are true together. Example: given what I know about tomorrow’s weather (next to nothing) it’s conceivable that it will rain, and it’s conceivable that there won’t be a cloud in the sky all day. But it’s not conceivable, even given my puny knowledge, that it will rain while there isn’t a cloud in the sky all day.

Now there’s not many analogies between the poll reader’s argument and my weather argument, except their common logical form, so one may wonder if there is a better way to justify the conclusion that we can’t know Dean is ahead of Kerry. Here was the best test I could come up with. It’s pretty crude, and I’d be interested in knowing whether there’s something with a greater theoretical justification that produces more plausible results.

I tried to see how probable it was that Dean would have a 7 point lead over Kerry given (a) a poll like this one, with 600 randomly chosen voters, and (b) the assumption that they are tied at around 25%. If I’ve run the simulations correctly, the probability of this is around 0.8%, or 0.008. Now comes the dubious step. (This step is known in some circles as the Prosecutor’s Fallacy.) Since the probability of the results given a tie between Dean and Kerry is 0.008, we’ll infer that the probability of Dean and Kerry being in a tie is 0.008. There’s really no theoretical support for that move, but it is hard to see how to get usable information from polls without doing something like that. So we conclude it is really very unlikely that Dean is not ahead of Kerry.

Is there a better way to get a usable number from the data? If not, is there any way to justify the last step, using perhaps some kind of independently justifiable priors?

{ 17 comments }

kokomo 08.20.03 at 1:45 am

What KOS actualy wrote is: “While still within the +/- 4% MOE (barely), Dean has taken the lead in the latest ARG poll out of New Hampshire.”

He’s less puzzled then you I think.

J. Michael Neal 08.20.03 at 1:50 am

I suppose that the correct answer would have to deal with the fact that Dean’s support being overestimated implies that other candidates’ support was underrepresented. This would increase the number of distributions in which Kerry’s support would be found to be equal to Dean’s, so 0.8% is undoubtedly low. If it were a binary choice (which it isn’t), Dean being overestimated would necessitate that Kerry was underestimated by the same amount.

Nevertheless, I’d eyeball the numbers as indicating that there is well over a 95% chance that Dean has more support than Kerry. I also think that, six months out from the primary, the difference is so small that it doesn’t matter much.

J. Michael Neal 08.20.03 at 1:53 am

Kokomo,

No, I don’t think that Dean and Kerry being tied actually is within the 95% confidence interval. Either Dean being at 24% *or* Kerry being at 25% is, but not both. This is a case where the very sloppy layman’s use of “margin of error” is incorrect.

Brian Weatherson 08.20.03 at 2:23 am

If I ran the simulation correctly, it should have taken into account the fact that it’s more probable that Kerry’s vote is under-reported conditional on Dean’s vote being over-reported. Indeed, if I just multiply the probabilities of Dean getting as high as 28 by that of Kerry getting as low as 21 (all conditional on them both really being at 24.5), the result is under 0.1%.

I agree entirely that this isn’t very meaningful 6 months out. I’m just interested in the theoretical question because it’s one that arises fairly frequently, and this looked to be a pretty extreme case.

J. Michael Neal 08.20.03 at 6:08 am

Then I believe that you have exceeded my statistical competance. I’ll get back to it when I have my degree in a couple of years.

Amit Dubey 08.20.03 at 1:27 pm

Hi,

You should not do this using a simulation. The probability you got was too low because you also have to simulate all other combinations of them being tied, or Kerry beating Dean, then take the integral. (This is the last step you were missing).

What you want to do is to set up a decision rule testing if one mean really is bigger than the other, and then test the hypotheses. Most introductory social science statistics texts should cover this.

Doug Turnbull 08.20.03 at 1:49 pm

Agree with the last post that you need to integrate your liklihood function. Plus, In some cases the liklihood function doesn’t sum to 100% (not sure if this is such a case), so you’d want to do the simulation for each possible result and then normalize to that value, which is a lot of work.

The other thing that I wonder about is whether your simulation would give you a margin of error of 4%, or whether your assumptions give you a smaller margin than that–it’s possible there are other systemic errors in the polling that increase the error margin above a true random sample.

Trying another tack, using the 4% figure and assuming it’s a sigma value (don’t know how they define it), and assuming statistical independance of the Dean and Kerry numbers (certainly not true), then you get a 1/6 probability that Dean’s numbers are 24% or below, and a 1/6 chance that Kerry’s numbers are 25% or above. So you have a roughly 1/36 chance that both are true.

Anyway, I agree with your underlying point that most people take margins of error and assume that they mean any number from the measured value +/- the MOE is equally likely, which is not how statistics or measurements work. It always bugs me when people bring out the “statistically tied” verbage, or some such, since it’s just not true.

Jeff Johnson 08.20.03 at 2:09 pm

The probability of any particular poll result is very low, given any assumption. This is not how you want to think of the results.

Suppose that the confidence level for the poll is 95%, which is fairly standard and seems to be compatible with the sample size and margin of error. Now, in response to J. Michael Neal, when you’re estimating the difference between two dependent variables, such as Dean’s and Kerry’s support, the margin of error for the difference is twice the margin of error for the individual variables, so a statistical tie would be within the confidence interval, because the margin of error for the difference would be +/- 8%.

What a 95% confidence level means is that if you did 100 polls with the same sample size, 95 of the polls would give results within the margin of error of the actual number in the target population. 5 of the polls, however, would give results which are not within the margin of error of the actual number in the target population. In other words, 5% of the time the polls are going to be dead wrong, even given the margin of error.

Thus, as I think Amit Dubey was suggesting, in order to calculate the probability that Dean is not leading Kerry, you have to take into account, among other things, the possibility that the actual numbers are, for example, Kerry 75% and Dean 3%.

Jeff Johnson 08.20.03 at 3:00 pm

I found a z-table and did a few calculations. Suppose we take the margin of error for Dean and Kerry’s poll numbers to be +/-3% instead of 4%. Since Dean got 28% and Kerry 21%, the difference here d=7%. The margin of error for the difference would now be +/-6%. Given our new margin of error and a sample size of 600, the confidence level would be about 85% instead of 95%.

Thus, we might say that there’s a 85% probability that 1%

Jeff Johnson 08.20.03 at 3:05 pm

The end of my post disappeared. It seems that the blogger doesn’t like the less-than sign. Anyway, I meant to say that there’s an 85% probability that d is between 1% and 13%.

Jeff Johnson 08.20.03 at 3:48 pm

Ooops, my explanation of confidence level was misleading. It’s not necessarily true that exactly 5 out of every 100 polls will be inaccurate at 95% confidence. That’s only in the limit.

Tim Lambert 08.20.03 at 5:54 pm

I don’t think you can work out the answer unless you know to what extent Dean and Kerry are competing for the same supporters.

If total support for Dean and Kerry is fixed at 49% so that any increase for Dean is matched by a decrease for Kerry, then the 95% confidence interval for the difference is +/- 8% so that a 7% difference is not significant.

On the other hand if they are not competing for the same voters (so that half the people will never vote for Kerry and the other half will never vote for Dean) then changes are independent and the 95% confidence interval for the difference is +/- 4sqrt(2) = +/- 5.6% and the difference is significant.

Reality is going to be in between these two cases, so the answer is “it depends”.

Thomas Dent 08.20.03 at 6:56 pm

What Tim said. Maybe Kos got into the habit of thinking that

the MOE means subtracting from one guy and adding to the

other from looking at two-horse races. If we can assume that

the distribution of ‘undecided’s is narrowly peaked and

their number is uncorrelated with either of the two candidates

then going to the MOE +4 for Kennedy means -4 for Nixon and

vice versa.

This is a rather tricky point since what one should be talking

about strictly is a probability distribution over the entire

multidimensional space of possible results adding up to 100%.

Inevitably it doesn’t always makes sense when you try to

summarize it in a single MOE. Truman vs. Dewey vs. Thurmond was

probably a case where quoting a single MOE would be misleading

if you wanted to find the likelihood of the actual numbers being

off by a certain number of points.

And then you have the problem of Clark (1 percent) – with the

MOE being +-4, this should mean that there is a large probability

of Clark’s actual percentage being negative! This piece of

nonsense comes about because MOE assumes that the distributions

are Gaussian, but they can’t be because the Gaussian extends

from minus infinity to plus infinity whereas the percentage

result is strictly between 0 and 100.

And then you have the fact that MOE represents only the statistical

random error, and you still have to contend with systematic biases,

for example Dean supporters being more likely to agree to answer the

poll because of a peculiar character trait that they are more likely

to possess…

If another poll with different methods comes out with similar numbers

it will be much more clear that Dean has a lead.

pathos 08.20.03 at 8:24 pm

I am surprised people are still doing phone polls.

I, and many other people, many of whom live in Vermont, have Call-Intercept or Call-Blocking or some such feature that systematically skews who is called in phone surveys. I am guess that the more right-wing you are, the more likely you are to block/screen your calls.

This is a new phenomenon, but it explains why the Republicans did so well in 2002, despite all polls showing that it would be much closer.

I no longer put any faith in polls conducted over the telephone. Might as well be an internet poll.

kokomo 08.20.03 at 8:52 pm

Given the data, there is a small chance that Dean and Kerry are tied, but a reasonable interpretation is that Dean is in the lead. This is what Kos communicated. The problem is an interesting one, but Kos’ statement is not a proper subject for the discussion.

bigring55t 08.22.03 at 3:35 am

Actually, despite all the math the solution lies in the realm of psychology. Kos works for the Dean campaign thus the careful wording is simply a troll prophylactic (borrowed from Atrios) meant to head off pointless accusations of unfairness.

claxton6 08.24.03 at 6:10 pm

>I, and many other people, many of whom live in Vermont, have Call-Intercept or Call-Blocking or some such feature that systematically skews who is called in phone surveys.

My experience with telephone surveys is largely in Nevada, which may be a little different from Vermont, but we only saw a very small number of households with Call-Intercept or Call-Blocking, and even among those households it was possible to get through to a household member.

Of course, I think that presumes that you have a live person doing the calling, since you have to state who’s calling. I think political polls do this, rather than automated dialling like telemarketers, but I don’t know for sure.

Comments on this entry are closed.