I thought I’d repost this piece from my old blog, because a multidisciplinary audience is just what it needs. The starting point is as follows:
Data mining’ is an interesting term. It’s used very positively in some academic circles, such as departments of marketing, and very negatively in others, most notably departments of economics. The term refers to the use of clever automated search techniques to discover putatively significant relationships in large data sets, and is widely used in a positive context. For economists, however, the term is invariably used with the implication that the relationships discovered are spurious, or at least that the procedure yields no warrant for believing that they are real. The classic article is Lovell, M. (1983), ‘Data mining’, Review of Economics and Statistics 45(1), 1–12, which long predates the rise to popularity of data mining in many other fields
So my first question is whether the economists are isolated on this, as on so much else? My second question is how such a situation can persist without any apparent awareness or concern on either side of the divide.
The paradigm example of data mining, though a very old-fashioned one is ‘stepwise regression’. You take a variable of interest then set up a multivariate regression. The computer then tries out all the other variables in the data set one at a time. If the variable comes up significant, it stays in, otherwise it’s dropped. In the end you have what is, arguably, the best possible regression.
Economists were early and enthusiastic users of stepwise regression, but they rapidly became disillusioned. To see the problem, consider the simpler case of testing correlations. Suppose, in a given dataset you find that consumption of restaurant meals is positively correlated with education. This correlation might have arisen by chance or it might reflect a real causal relationship of some kind (not necessarily a direct or obvious one). The standard statistical test involves determining how likely it is that you would have seen the observed correlation if there was in fact no relationship. If this probability is lower than, say, 5 per cent, you say that the relationship is statistically significant.
Now suppose you have a data set with 10 variables. That makes 45 (=10*9/2) distinct pairs you can test. Just by chance you’d expect two or three correlations that appear statistically significant correlations. So if your only goal is to find a significant relationship that you can turn into a publication, this strategy works wonders.
But perhaps you have views about the ‘right’ sign of the correlation, perhaps based on some economic theory or political viewpoint. On average, half of all random correlations will have the ‘wrong’ sign, but you can at expect to find at least one ‘right-signed’ and statistically significant correlation in a set of 10 variables. So, if data mining is extensive enough, the usual statistical checks on spurious results become worthless.
In principle, there is a simple solution to this problem, reflecting Popper’s distinction between the context of discovery and the context of justification. There’s nothing wrong with using data mining as a method of discovery, to suggest testable hypotheses. Once you have a testable hypothesis, you can discard the data set you started with and test the hypothesis on new data untainted by the process of ‘pretesting’ that you applied to the original data set.
Unfortunately, at least for economists, it’s not that simple. Data is scarce and expensive. Moreover, no-one gets their specification right first time, as the simple testing model would require. Inevitably, therefore, there has to be some exploration (mining) of the data before hypotheses are tested. As a result, statistical tests of significance never mean precisely what they are supposed to.
In practice, there’s not much that can be done except to rely on the honesty of investigators in reporting the procedures they went through before settling on the model they estimate. If the results are interesting enough, someone will find another data set to check or will wait for new data to allow ‘out of sample’ testing. Some models survive this stringent testing, but many do not.
I don’t know how the users of data mining solve this problem. Perhaps their budgets are so large that they can discard used data sets like disposable syringes, never infecting their analysis with the virus of pretesting. Or perhaps they don’t know or don’t care.
{ 22 comments }
dsquared 02.13.04 at 10:38 am
Basically your intuition is right; data mining in marketing is what you do when you’ve got a mountain of data that you’ve collected because it “seemed like a sensible thing to do” and want to know what to do with it. Like Amazon.com’s sales data. Also note that marketing people typically use nonlinear fitting techniques rather than stepwise regression these days, because they’re not as hung up as economists on t-ratios greater than 1.96 (they are, correctly, more concerned with practical significance than statistical significance).
I like the phrase “data dredging” to describe the pejorative sort, because a miner occasionally strikes gold but a dredger just stirs up sludge.
Scott Martens 02.13.04 at 10:57 am
In my field, data mining is more or less synonymous with corpus linguistics, data-driven NLP and anti-nativism. A lot of those techniques are taken seriously in discussions about the learnability of natural language.
I think the problem is that in economics, it has come to mean the discovery of non-obvious relationships in data that may or may not be spurious, but which are used to support or undermine theories. In linguistics and AI, it’s a lot more about showing that something can or can’t be learned in a robust way. The theories are about what can be accomplished with data mining, not about the results themselves.
NB: What I do for a living is apply data mining techniques to corpora of texts in order to extract commerically useful information about usages. So, I may be biased.
There are people in linguistics who are very critical of data-driven learning, but I don’t remember ever seeing one connect their criticisms to criticisms of statistical analysis in economics. It strikes me as good rhetorical approach to attacking the new paradigm in linguistics, so I’m surprised that I haven’t seen someone try it. The idea never occured to me until this moment, so maybe it’s never occured to anyone else.
Kieran Healy 02.13.04 at 11:40 am
There are respectable data mining competitions where the competitors are given half a huge data set and told to develop a model on it. Then the models are tested on the other half.
Also, as Daniel said, nonparametric methods are now very well established, and increasingly common in the social sciences… I think the data-mining side of things complements the development of model validation methods of various sorts (like jackknifing, extreme-bounds analysis, resampling methods and so on) that help you check the value of your model. They’re both consequences of the computational revolution in statistical modeling that’s taken place in the past 20-odd years.
Chris Lightfoot 02.13.04 at 12:10 pm
Others may be interested in reading this piece: “Why Stepwise Regression is Dumb“.
Kieran Healy 02.13.04 at 12:20 pm
Frank Harrell, linked to in Chris Lightfoot’s comment above, is the author of the excellent Regression Modeling Strategies, which is one of the books I try to live up to when doing my own data analysis.
dsquared 02.13.04 at 12:31 pm
I’d also give a qualified defence of stepwise regression in the right hands; as practiced by David Hendry and gang, it’s not always bad. In particular, if you have a more sophisticated way of looking at the significance of sequential improvements in fit (encompassing tests), you can get worthwhile results.
< ahref="http://www.pcgive.com/pcgets/index.html?content=/pcgets/gtscrits.html">Here’s Doornik’s qualified defence of data mining as applied in PcGets.
dsquared 02.13.04 at 12:34 pm
working link
Barry 02.13.04 at 2:05 pm
Note: in statistics and biostatistics, ‘data mining’ is generally considered to be a bad thing. And replacing stepwise regression with any other statistical procedure shouldn’t solve the problem, since it’s a problem of multiple testing, not just of correlations.
From what I heard, neural networks researchers had to learn about this, even while using complex nonlinear models. They also had to learn about what happens when your model has as many parameters as data points.
dsquared 02.13.04 at 2:46 pm
Barry: There is nothing wrong with multiple testing as long as you’re very careful about interpreting the results (after all, it would otherwise be utterly impossible to ahve a valid model of GDP since so much work has already been done on the time series). Have a look at Doornik’s responses to criticisms of PcGets.
Chris Genovese 02.13.04 at 3:25 pm
In Statistics, the term data mining did historically refer to sifting through data for significant effects, with spurious results consequently produced by implicit selection bias. (For example, picking the most significant coefficient in a regression involves an implicit maximum over a many statistics, and the null distribution of the maximum is unlike that of an arbitrarily chosen statistic.) Definitely a bad thing, though I don’t think the term itself really captures this negative sense. (Data sifting might have been better, or as suggested above, data dredging.)
Computer scientists noticed this and reasoned that those little chunks of gold coming out of the mountain can be valuable. So they redefined the term accordingly. It’s now often taken to mean algorithmic methods for extracting information from large data sets without the necessary association with spurious effects. We even have a course on Data Mining now; many of our students have never encountered the historically negative meaning.
On stepwise regression, I agree with Daniel that it can be done well if handled carefully. But the variable selection problem has also received a great deal of research attention in Statistics over the past decade. Developments include new information criteria, prediction error estimates, the Lasso (an absolute error criterion), new multiple testing methods like False Discovery control, and Bayesian approaches. These frequently outperform traditional stepwise regression and related methods. Moreover, more attention is now going to model averaging, where one incorporates the uncertainty about the model into the inferences rather than throwing it away in choosing just one.
To follow up on Daniel’s last point regarding multiple hypothesis testing, that area has also undergone a renaissance in recent years, spurred in large part by Benjamini and Hochberg’s False Discover Rate control. There is now a much wider range of error measures that can be controlled, even for millions of simultaneous tests, eliminating much of the interpretive angst traditionally associated with this problem.
eszter 02.13.04 at 3:32 pm
I think most sociologists look at data mining in a negative light as well. But it may depend on your training and how you feel about use of quantitative data. I was taught – and prefer this approach – that you should usually theorize even what are often simply referred to as “the usual control variables” so I prefer to have conceptual reasons for including variables in an analysis and that’s what I’m teaching my students as well.
Somewhat related is a point that came up in class yesterday when we were discussing a piece by an economist. One of the students noted that the author had not controlled for the usual variables when looking at how computer and Internet use at work may be influencing hours worked. I noted to the student that the author had, in fact, controlled for many many factors but did not include these in the table. Rather, they are listed in the appendix (not in a table with coefficients, just as a list of controls included in the analyses). I don’t ever recall seeing that done by a sociologist. (Granted, this is a working paper, I don’t know how often this is done in econ in a final publication.) In case anyone’s curious, the piece we were discussing was Richard Freeman’s NBER working paper on “The Labour Market in the New Information Economy“.
Erik 02.13.04 at 4:14 pm
I find the Bayesian model averaging approach a very appealing solution to this. It is based on the realization that the reason we think data mining is useful is because we have a priori uncertainty about what the ‘correct’ model is. This uncertainty, however, should not be ignored in our inferences.
Bill 02.13.04 at 4:33 pm
Epidemiologists and biostatisticians (e.g., as Kieran notes, Frank Harrell) worry about data mining in the perjorative sense for the same reasons that economists do. However, we are also trying to see what can be learned from the marketing techniques (data mining in the positive sense): we want to find problems in clinical practice that we aren’t currently aware of by spotting data anomalies (artificial example: “how come there so few imaging charges associated with appendectomies in the last six months? can they possibly be opening folks up without looking first???”).
Aaron Bergman 02.13.04 at 4:41 pm
As another data point, data mining is used perjoratively in physics.
Steve 02.13.04 at 5:38 pm
Another thing you failed to mention is that pretest estimators are equal to neither the OLS or RLS (Restricted Least Squares) estimators. This makes is hard to evaluate pretest estimators.
The sampling distributions are stochastic mixture of the unrestricted estimator and the restricted estimator. Pretest estimators are also discontinuous functions of the data and a small change in the data can bring a switch in the results.
Also many economists use pretest estimators.
dsquared 02.13.04 at 6:33 pm
Cosma Shalizi knows a hell of a lot about this general area, though he may disagree with me over the spelling of his name.
rvman 02.13.04 at 7:59 pm
Economists reject data mining based on methodology. The purpose of econometric analysis is hypothesis testing. Those hypotheses must come from theory. Without theory, there is no hypothesis to test. I suppose you could use mining to find hypotheses to test, but you would then have to “start over”, developing a theory which calves off that particular hypothesis, and then testing that hypothesis on yet another data set – the first is contaminated. (The regression statistics wouldn’t mean what you would be claiming they meant, if you used the first data set.)
There isn’t much concern on either side of the divide because the marketeers view the economists’ concerns as foolish, and economists view most business research as garbage. (Hence the low esteem generally given to economics departments in business, rather than academic, colleges.)
Jacob T. Levy 02.13.04 at 8:11 pm
In both poli sci and demography, the situation is more or less as John describes it. Everyone knows data mining is wrong. Everyone understands, in principle, that if you do it to build your model you ought to then throw the data set out and start again. No one really does this except for graduate students fresh from their methodology class and burning with the desire to Do Things Right. People use stepwise regression with a bit of bad conscience (though I gather that some people are enthusiatsic believers in it), but they use it. They try to develop some kind of theory before starting to mine, try to be honest with themselves and their readers as they go about not coming up with an entirely post-hoc theory to justify the associations that just happen to be significant. They try to avoid the worst excesses of just throwing everything into the model and seeing what comes out, but don’t really design a model entirely in isolation from the data set they’re going to test it on.
Omri 02.13.04 at 8:41 pm
There was once a Dilbert cartoon about data mining.
John Quiggin 02.13.04 at 9:15 pm
Eszter’s point about the ‘usual control variables’ is correct. It’s increasingly common to report only the variables of interest. I don’t have a big problem with this, particularly if the desirable practice of making the full dataset and detailed results available on a website is followed.
My big concern is with the use of instrumental variables, which are proxies for some variable of interest constructed from other supposedly exogenous variables in a way that is supposed to avoid bias due to simultaneous causation. That’s all well and good, but the fact is still that the regression relates the dependent variable to the exogenous variables. Yet the procedure used to construct the instrument from those variables is often described poorly and sometimes not at all.
bill carone 02.13.04 at 9:31 pm
Perhaps a Bayesian view would help.
In Bayesian analysis, you always test one hypothesis against others (real ones, not imaginary null and alternative ones :-). For example, if you were fitting a line to data on a 2D graph, you might start with the idea that all slopes and all intercepts were equally likely before you saw the data, then do calculations to see how observing the data changed those probabilities. So, after seeing the data, you would know how likely each possible line was.
For example, take the following model. This model says that there are ten possible slopes (m=1,2,3, … 10) and the intercept of the line equals zero. So there are ten possible lines we are considering.
Each possible line starts with 10% probability (called the prior probabilities, since they are the probabilities you have prior to getting the data). Then, when we see the data, that probability changes (it is multiplied by a factor that depends on how close the data fit each possible line).
After seeing the data, we would then have a new set of 10 probabilities, one for each line, describing how likely each line is now.
Now, let’s do some data mining; let’s look at another model, using the same data. This model, instead of assuming the intercept equals zero, gives it ten possible values (0,1,2,…,10). Now, instead of ten possible lines, we have 100.
This has two effects:
1. It (most likely) makes the best line fit better; after all, instead of just 10 lines to choose from, you have 100.
2. It lowers the prior probabilities; now each line starts with 1% probability rather than 10%.
So, the second model, with two parameters, may give worse results than the first, with only one. For example, if you could only fit the data, say, two times better (in some sense), then it doesn’t outweigh the tenfold loss of prior probability.
So, in comparing the two models, the first one would be better; even though the “best fit” line of the second model is better than the first, the second model ends up with a lower probability. The calculations effectively took Occam’s razor to it, and cut it down to size.
So, Bayesian analysis automatically makes the trade-off between fitting the data and introducing new parameters into a model.
My understanding of stepwise regression is that this doesn’t happen; am I correct? In other words, every time you put a new variable in, the p-value gets better (take one out, it gets worse).
Does this solve the problem of data mining you are having? I’m not sure it does.
Let’s say you test 1000 models on the same set of data. In classical terms, even if there is no “connection” whatever, about 5% will be statistically significant at the 5% level (hey, 50 publishable papers, I’m on my way to tenure :-). This is the original problem with data mining, correct?
What happens in Bayesian terms? You end up with a probability for each model being the correct one.
This assumes that one and only one is correct. The “only one” part is fine, I’m not so sure about the “one” part. In other words, you are assuming that, among the possible hypotheses you have, one is the correct one. Any problems with that? Perhaps not; maybe the outcome of a statistical analysis should be something like “Of the hypotheses we considered, this one is (say) 20 times more likely than all the others combined.”
Then, you assign appropriate prior probabilities to each model. This addresses the “sure-thing hypothesis” problem with data mining. Once you see the data, you could come up with a model that predicts “That data is the only possible thing that could have happened.” This is clearly the winner in terms of fitting the data, but the prior probability of such a hypothesis is too small to consider.
Then, for each model, the analysis above will take into account (a) how well the model fits the data and (b) how many parameters it took to do so. This isn’t a vague, subjective trade-off; it is built into the calculations themselves, as my example above suggested.
So, what are the difficulties with this Bayesian data mining? Real question, BTW; I’m no expert. One problem is that most statistics courses are too “cookbook” and like to use techniques that don’t require much actual understanding of what is going on. For similar reasons, such a method won’t be included in a simple statistical package. Another problem might be calculational; if the above analysis required 1000 nested integrals, then we would have to turn to other methods (MCMC, perhaps). Others?
toni wuersch 02.15.04 at 2:10 am
Alas, the long and short about data mining is that it’s another subject that only people with enough training can decide is good or bad in a given instance. Either it casually alludes to a methodology some “we” all know, or it’s the prelude to a lovefest or a witchhunt.
I usually change the question and ask what is the analytic goal — to summarize, to predict, or to find and pick low-hanging fruit?
The last thing I would ever want is to have a project I’m working on shut down, because a party had some notion about data mining.
Comments on this entry are closed.