Looking at Data

by Kieran Healy on October 13, 2009

Jeremy Freese is doing some analysis:

So, the General Social Survey reinterviewed a large subset of 2006 respondents in 2008. They have released the data that combines into one file the respondents interviewed for the first time in 2008 and the 2008 reinterviews of the respondents originally interviewed in 2006. In a separate file, of course, you can get the original 2006 interviews for the latter people.

What has not yet been released, however, is the variable that would identify what row in the first file corresponds to what row in the second file. In other words, you know that person #438 in the reinterview data is somebody originally interviewed in 2006, but you don’t know what person in the 2006 data there are.

Well, especially because the last thing I need to be doing right now is procrastinating, that sounded like a challenge. Just as I have learned that just because there are no microwave instructions for a frozen dinner doesn’t mean you can’t microwave it, just because there isn’t a merge variable doesn’t mean you can’t merge the data. At least if no secure data agreement is involved.

All I have to say is: holy crap. You’d think knowing somebody’s sex, survey ballot (which was kept the same both times), zodiac sign, year of birth, self-identified race, region where they lived where they were 16, whether they lived with their parents when they were 16, whether they lived in the same place they did growing up, who they said they voted for in 2004, their marital status, their education, what they say they did for a living, how many years their mother went to school, inter alia, would allow you to pretty easily pinpoint who is who. I am here to tell you this is not the case.

I was able to devise some convoluted scheme and check how well it was doing thanks to a pretty big clue that I’ll refrain from posting, but even then there ended up being 50 cases that out of 1500 that I wasn’t sure who they were. In general the experience affirmed a fundamental suspicion I’ve had about analyzing survey data: the data seem so much less real once you ask the same person the same question twice.

The real distinction between qualitative and quantitative is not widely appreciated. People think it has something to do with counting versus not counting, but this is a mistake. If the interpretive work necessary to make sense of things is immediately obvious to everyone, it’s qualitative data. If the interpretative work you need to do is immediately obvious only to experts, it’s quantitative data.

{ 14 comments }

1

Ahistoricality 10.13.09 at 1:36 pm

If the interpretive work necessary to make sense of things….

That’s almost funny.

2

Ray 10.13.09 at 2:28 pm

I’m guessing there was a large difference between the answers to this question, “who they said they voted for in 2004” in 2006 and 2008.

Some of the other questions – marital status, employment, if they live in the same place – may have different answers in a later survey, because those things have changed. Zodiac sign and number of years mother went to school may not be remembered well. ‘Self-identified race’ is also obviously susceptible to change.

Including all of those fields is probably making things harder, not easier.

3

Barry 10.13.09 at 2:33 pm

Recall errors would also contaminate things. If you’re matching on 10 remembered data points per person, a recall error rate of less than excellent would really mess you up.

4

Billikin 10.13.09 at 2:44 pm

“The real distinction between qualitative and quantitative is not widely appreciated. People think it has something to do with counting versus not counting, but this is a mistake. If the interpretive work necessary to make sense of things is immediately obvious to everyone, it’s qualitative data. If the interpretative work you need to do is immediately obvious only to experts, it’s quantitative data.”

That is funny. ;)

5

eudoxis 10.13.09 at 4:43 pm

Data that are more “real”, like medical histories, present with the same problem. Merging patient information on a grid is impossible without a unique identifier. Greater accuracy is required, of course, but still, one would think that collecting and merging all the obvious fields would work.

6

Ken Houghton 10.13.09 at 5:57 pm

I fear Ray’s point–and Jeremy’s desire to avoid downtime–will keep them from now cross-referencing the respondents so that datamining can be used to see who l/i/e/d/changed their 2004 voting preference in 2008.

7

Barry 10.13.09 at 6:03 pm

eudoxis 10.13.09 at 4:43 pm

“Data that are more “real”, like medical histories, present with the same problem. Merging patient information on a grid is impossible without a unique identifier. Greater accuracy is required, of course, but still, one would think that collecting and merging all the obvious fields would work.”

Depending on circumstances:

1) Some fields might (will) have missing data for one or the other wave.
2) Enough fields in data set 1 would have to have to also be in data set 2.
3) As I pointed out above, if there are a set of overlapping fields in the two data
sets, but there are errors in recording (recalling) the values, then that set of
values will not be the same for a person between the two data sets.

8

frances 10.13.09 at 8:06 pm

This is absolutely fascinating and proves to me [yet again] my ignorance – all I will add is I can never get through identity authentication routines by use of “known facts”. First boyfried, pet, school – whatever it is – I don’t think there is more then one instance in my memory bank but there must be as I always seem to fail the test whether internet or call centre. And I work in id management policy which is the odd thing. (Only item I saw in a suggested list of remembered facts I stood a chance with was mother’s Co-op divi number – but you hve to be UK & a certain age and class before that works.)

Human beings always complicate things

9

frances 10.13.09 at 8:07 pm

boyfried? where did that come from

10

Tim B 10.13.09 at 8:28 pm

frances: “…I work in id management policy which is the odd thing.

Id management policy. That’s something the world needs a lot more of, in my opinion.

11

Phil 10.13.09 at 9:35 pm

I was subjected to a sort of Monty Python catechism the other day. “All right, Sir, just one more thing, could you tell me: Famous place…?” “Purley”, I replied, and he was prepared to process my query (which as it happened was “how can I give you some more money?”, so the ID-verification really was overkill).

What I think must have happened is that he read out the hint by mistake for the prompt – although obviously I’m not going to confide here what Purley is to me or I to Purley. (Not the obvious, I’ll say that.)

12

Jeremy 10.16.09 at 6:36 am

@2 – they don’t actually ask respondents their zodiac sign. They ask them their birthdate, which is then recoded to zodiac sign reflecting the interest some people have in debunking astrology. There were more one-sign differences in zodiac for ostensibly matching respondents than I would have guessed; I suspect this is less about slight misreports of birthday by respondents than about interviewer typing errors, but that’s a suspicion.

@6 – Given that the entire grant justification for doing the repeated interview was the possibility of doing panel analysis, it’s not a real possibility that NORC will not release the panel identifier. Their not having done so already is annoying, though.

13

Alan Peakall 10.16.09 at 12:00 pm

Jeremy,

If, as you say, the zodiac sign is processed data rather than raw data, is it possible that many of zodiac sign mismatches are a result of the encoding being done for the reporting year, rather than the birth year of the respondent? As the reporting years differ by an odd multiple of two years, the leap year cycle could skew half of those born on a cusp.

14

Jeremy 10.16.09 at 5:28 pm

@13 – I wondered about the cusp issue. My presumption is that, given that I don’t think NORC really takes the ZODIAC variable that seriously, they just had some algorithm that always coded the same month/day of birth the same way, even if this might not be the fully astrologically correct way of doing it.

Comments on this entry are closed.