Why You Should Never Trust a Data Scientist

by Henry Farrell on July 18, 2013

“Pete Warden”:http://petewarden.com/2013/07/18/why-you-should-never-trust-a-data-scientist/

bq. The wonderful thing about being a data scientist is that I get all of the credibility of genuine science, with none of the irritating peer review or reproducibility worries. My first taste of this was my Facebook friends connection map. The underlying data was sound, derived from 220m public profiles. The network visualization of drawing lines between the top ten links for each city had issues, but was defensible. The clustering was produced by me squinting at all the lines, coloring in some areas that seemed more connected in a paint program, and picking silly names for the areas. I thought I was publishing an entertaining view of some data I’d extracted, but it was treated like a scientific study. A New York Times columnist used it as evidence that the US was perilously divided. White supremacists dug into the tool to show that Juan was more popular than Juan[HF – John???] in Texan border towns, and so the country was on the verge of being swamped by Hispanics. …

bq. I’ve enjoyed publishing a lot of data-driven stories since then, but I’ve never ceased to be disturbed at how the inclusion of numbers and the mention of large data sets numbs criticism. The articles live in a strange purgatory between journalism, which most readers have a healthy skepticism towards, and science, where we sub-contract verification to other scientists and so trust the public output far more. … If a sociologist tells you that people in Utah only have friends in Utah, you can follow a web of references and peer review to understand if she’s believable. If I, or somebody at a large tech company, tells you the same, there’s no way to check. The source data is proprietary, and in a lot of cases may not even exist any more in the same exact form as databases turn over, and users delete or update their information. Even other data scientists outside the team won’t be able to verify the results. The data scientists I know are honest people, but there’s no external checks in the system to keep them that way.

[via Cosma – Cross-posted at The Monkey Cage]

{ 21 comments }

1

marcel 07.18.13 at 7:10 pm

Why You Should Never Trust a Data Scientist… especially when a data scientist tells you that data scientists are not to be trusted!

2

Metatone 07.18.13 at 8:09 pm

He’s a better person than many to stand up and say this about his own work.

Doubt it will change much of the “journalism” about these kinds of projects.

3

Tom Slee 07.18.13 at 8:56 pm

Pete Warden sounds like an admirable guy, but I’ve never understood how the term “data science” makes any sense at all. We should call it “datanomics” and then everyone would know not to trust it. Or is this a case of John Searle’s aphorism: “anything that calls itself a science probably isn’t”?

4

Eskimo 07.18.13 at 9:03 pm

Agreed. All the plain old scientists I know are very interested in producing and interpreting data. Perhaps a more appropriate term is “non-academic scientist”?

5

John Quiggin 07.18.13 at 9:32 pm

One oddity that now seems to have vanished is “data mining”. This was, and is, used as a term of derision among econometricians, as I mentioned here
https://crookedtimber.org/2004/02/13/data-mining/
John Lott was the all-time expert at this, able to find a statistically significant and politically convenient relationship in any data set he encountered.

For a long time, though, “data mining”, used in a positive sense, was also a buzzword among “data scientists”. It seems now to have been replaced by “Big Data”, which sounds a lot cuter than the econometric equivalent term “large data sets”.

6

Jerry Vinokurov 07.18.13 at 10:02 pm

I don’t know that data mining has really vanished, although maybe its use in the popular press has receded. It was certainly used in reference to the NSA’s PRISM program, and it’s used in machine learning academia all the time.

7

mpowell 07.18.13 at 10:21 pm

After reading the article, I still don’t know what a data scientist is. Is it just someone who takes data sets, runs them through some kind of analytic filter and then publishes the results in non peer-reviewed articles? That’s a pretty broad definition. And why is he so worried about what scientists who publish in peer-reviewed articles are doing? When they find it interesting to examine the data sets Pete Warden is looking at, they do so. They might even publish some peer-reviewed articles based on them. But that won’t do anything to stop people continuing to publish non peer-reviewed work and for the rest of the world to take it more seriously than they should. I don’t see how that’s the fault of ‘real’ scientists.

8

Phil 07.18.13 at 10:33 pm

I wonder if the appeal of “big data” is linked to Chris Anderson’s discovery that theory was dead, because hey, lots of data!, which you could analyse with big computers!! and so you didn’t need to think about the questions you were asking, because, er.

9

Barry 07.18.13 at 11:15 pm

“After reading the article, I still don’t know what a data scientist is. Is it just someone who takes data sets, runs them through some kind of analytic filter and then publishes the results in non peer-reviewed articles? That’s a pretty broad definition.”

Somebody used to dealing with large data sets and machine learning techniques seems to be what they are talking about.

10

Tom Slee 07.18.13 at 11:30 pm

Apologies in advance for telling people things they probably already know and for getting bits wrong, but I think there is a bit of a coherent history to the terms, if not to the outcomes.

“Big Data” is pretty explicitly linked back to the Google File Store and “Big Table” implementations and to Amazon’s Dynamo database, all of which came out of having to handle data sets spread across enough computers that routine failure of hard drives and network connections had to be treated as routine. There was academic work before that around the CAP Theorem and I don’t know how these were connected. To me, these were qualitatively different approaches to data management and deserved some kind of new name, but the spread of the term Big Data to encompass anything more than a big spreadsheet is obviously a bit silly.

Likewise, Phil’s comment about Chris Anderson’s Wired article likewise goes back to Google and the Unreasonable Effectiveness of Data (PDF). FWIW I think there is less justification for treating this as a significant new development.

11

John Voorheis 07.19.13 at 12:41 am

I mean, its not like, e.g. empirical economics is some paragon of reproducibility or something, right? Many if not most journals don’t require code and data to be submitted with a paper (although this seems to be slowly changing).

12

RSA 07.19.13 at 1:02 am

For a long time, though, “data mining”, used in a positive sense…

It still is among computer scientists, though they’re also aware of the negative connotations. (When I was on the periphery of the area some years ago, “data dredging” was an expression for the unambiguously bad practice). The KDD conference, on knowledge discovery and data mining, still runs annually.

13

Marc 07.19.13 at 1:50 am

Large data sets, and the tools to explore what they mean, are becoming central in a lot of scientific fields. The key to making them useful is twofold: public data and public software. If you can replicate what others have done then the objections here are answered. If anything, this is better than the status quo ante: for example, it used to be very uncommon to release software (largely because scientists are not good programmers, and supporting other people trying to use badly documented code is a lot of work.)

In astronomy, examples include the Hubble Space Telescope (virtually all data is public) or Kepler (the full data sets used to find planets, and tools to analyze them, are public.) Upcoming missions like LSST (full scans of the sky every few days) and Gaia (geometric distances to ~1,000,000,000 targets, with lots of other data) will flood scientists with data – similar to what people are dealing with now in climate science, or the biological sciences.

So in this context the art of being able to find patterns in enormous data sets is becoming almost an independent enterprise; there is something similar in what people do to try and make sense of the complex simulations that it’s now possible to perform in modern computers.

14

PJW 07.19.13 at 2:18 am

I recall there is also some interesting work taking place in the analysis of literature by drilling down into Big Data.

15

PJW 07.19.13 at 2:30 am

Interesting article, but the photo of the drill has stuck with me since I first read this piece last summer. Apologies for the double post.

http://www.wired.com/wiredenterprise/2012/08/googles-mind-blowing-big-data-tool-grows-open-source-twin/

16

floopmeister 07.19.13 at 4:59 am

“After reading the article, I still don’t know what a data scientist is. Is it just someone who takes data sets, runs them through some kind of analytic filter and then publishes the results in non peer-reviewed articles? That’s a pretty broad definition.”

Yes, but after claiming to have run an exhaustive analysis of Linkedin I can state that this is what most of the people doing that sort of work call themselves so, naturally, this is what they are called.

Journalists interested in publishing this definitive conclusion please leave contact details in this comment thread.

17

Mao Cheng Ji 07.19.13 at 7:13 am

A person, like myself, who works with large databases, can learn a lot about large databases. Making claims about the outside world is problematic. If you choose (or are forced) to make these claims, manipulating and misrepresenting is real easy.

18

Karri 07.19.13 at 2:07 pm

This data science thing really seems to be mostly about branding “analyst” to sound sexier. That said, I’m fairly happy about it, since the companies looking for data scientists seem to be willing to hire actual ex-scientists from academia to work on pretty interesting stuff. Please someone hire me! I think I can even dimly recall my undergrad stats.

19

RSA 07.19.13 at 4:33 pm

The wonderful thing about being a data scientist is that I get all of the credibility of genuine science, with none of the irritating peer review or reproducibility worries.

For me, a few messages come across in this passage. Not everything that data scientists do is actually science. Journalists and lay people are not the best judges of what constitutes science. Some data scientists may not have a clear idea (possibly for good reason) about how their work fits into larger scientific endeavors—if you take out objective methods, verifiability, reproducibility, hypotheses, and models, there’s not much left.

20

JP Stormcrow 07.20.13 at 11:28 am

Big Data: The persistent belief that any sufficiently large pile of horseshit contains a pony.

21

Tony Lynch 07.20.13 at 1:02 pm

Which pony is then an enemy combatant.

Comments on this entry are closed.