February 13, 2004

Data mining

Posted by John Quiggin

I thought I'd repost this piece from my old blog, because a multidisciplinary audience is just what it needs. The starting point is as follows:

Data mining' is an interesting term. It's used very positively in some academic circles, such as departments of marketing, and very negatively in others, most notably departments of economics. The term refers to the use of clever automated search techniques to discover putatively significant relationships in large data sets, and is widely used in a positive context. For economists, however, the term is invariably used with the implication that the relationships discovered are spurious, or at least that the procedure yields no warrant for believing that they are real. The classic article is Lovell, M. (1983), ‘Data mining’, Review of Economics and Statistics 45(1), 1–12, which long predates the rise to popularity of data mining in many other fields

So my first question is whether the economists are isolated on this, as on so much else? My second question is how such a situation can persist without any apparent awareness or concern on either side of the divide.


The paradigm example of data mining, though a very old-fashioned one is 'stepwise regression'. You take a variable of interest then set up a multivariate regression. The computer then tries out all the other variables in the data set one at a time. If the variable comes up significant, it stays in, otherwise it's dropped. In the end you have what is, arguably, the best possible regression.

Economists were early and enthusiastic users of stepwise regression, but they rapidly became disillusioned. To see the problem, consider the simpler case of testing correlations. Suppose, in a given dataset you find that consumption of restaurant meals is positively correlated with education. This correlation might have arisen by chance or it might reflect a real causal relationship of some kind (not necessarily a direct or obvious one). The standard statistical test involves determining how likely it is that you would have seen the observed correlation if there was in fact no relationship. If this probability is lower than, say, 5 per cent, you say that the relationship is statistically significant.

Now suppose you have a data set with 10 variables. That makes 45 (=10*9/2) distinct pairs you can test. Just by chance you'd expect two or three correlations that appear statistically significant correlations. So if your only goal is to find a significant relationship that you can turn into a publication, this strategy works wonders.

But perhaps you have views about the 'right' sign of the correlation, perhaps based on some economic theory or political viewpoint. On average, half of all random correlations will have the 'wrong' sign, but you can at expect to find at least one 'right-signed' and statistically significant correlation in a set of 10 variables. So, if data mining is extensive enough, the usual statistical checks on spurious results become worthless.

In principle, there is a simple solution to this problem, reflecting Popper's distinction between the context of discovery and the context of justification. There's nothing wrong with using data mining as a method of discovery, to suggest testable hypotheses. Once you have a testable hypothesis, you can discard the data set you started with and test the hypothesis on new data untainted by the process of 'pretesting' that you applied to the original data set.

Unfortunately, at least for economists, it's not that simple. Data is scarce and expensive. Moreover, no-one gets their specification right first time, as the simple testing model would require. Inevitably, therefore, there has to be some exploration (mining) of the data before hypotheses are tested. As a result, statistical tests of significance never mean precisely what they are supposed to.

In practice, there's not much that can be done except to rely on the honesty of investigators in reporting the procedures they went through before settling on the model they estimate. If the results are interesting enough, someone will find another data set to check or will wait for new data to allow 'out of sample' testing. Some models survive this stringent testing, but many do not.

I don't know how the users of data mining solve this problem. Perhaps their budgets are so large that they can discard used data sets like disposable syringes, never infecting their analysis with the virus of pretesting. Or perhaps they don't know or don't care.

Posted on February 13, 2004 10:24 AM UTC
Comments

Basically your intuition is right; data mining in marketing is what you do when you’ve got a mountain of data that you’ve collected because it “seemed like a sensible thing to do” and want to know what to do with it. Like Amazon.com’s sales data. Also note that marketing people typically use nonlinear fitting techniques rather than stepwise regression these days, because they’re not as hung up as economists on t-ratios greater than 1.96 (they are, correctly, more concerned with practical significance than statistical significance).

I like the phrase “data dredging” to describe the pejorative sort, because a miner occasionally strikes gold but a dredger just stirs up sludge.

Posted by dsquared · February 13, 2004 10:38 AM

In my field, data mining is more or less synonymous with corpus linguistics, data-driven NLP and anti-nativism. A lot of those techniques are taken seriously in discussions about the learnability of natural language.

I think the problem is that in economics, it has come to mean the discovery of non-obvious relationships in data that may or may not be spurious, but which are used to support or undermine theories. In linguistics and AI, it’s a lot more about showing that something can or can’t be learned in a robust way. The theories are about what can be accomplished with data mining, not about the results themselves.

NB: What I do for a living is apply data mining techniques to corpora of texts in order to extract commerically useful information about usages. So, I may be biased.

There are people in linguistics who are very critical of data-driven learning, but I don’t remember ever seeing one connect their criticisms to criticisms of statistical analysis in economics. It strikes me as good rhetorical approach to attacking the new paradigm in linguistics, so I’m surprised that I haven’t seen someone try it. The idea never occured to me until this moment, so maybe it’s never occured to anyone else.

Posted by Scott Martens · February 13, 2004 10:57 AM

There are respectable data mining competitions where the competitors are given half a huge data set and told to develop a model on it. Then the models are tested on the other half.

Also, as Daniel said, nonparametric methods are now very well established, and increasingly common in the social sciences… I think the data-mining side of things complements the development of model validation methods of various sorts (like jackknifing, extreme-bounds analysis, resampling methods and so on) that help you check the value of your model. They’re both consequences of the computational revolution in statistical modeling that’s taken place in the past 20-odd years.

Posted by Kieran Healy · February 13, 2004 11:40 AM

Others may be interested in reading this piece: “Why Stepwise Regression is Dumb”.

Posted by Chris Lightfoot · February 13, 2004 12:10 PM

Frank Harrell, linked to in Chris Lightfoot’s comment above, is the author of the excellent Regression Modeling Strategies, which is one of the books I try to live up to when doing my own data analysis.

Posted by Kieran Healy · February 13, 2004 12:20 PM

I’d also give a qualified defence of stepwise regression in the right hands; as practiced by David Hendry and gang, it’s not always bad. In particular, if you have a more sophisticated way of looking at the significance of sequential improvements in fit (encompassing tests), you can get worthwhile results.

< ahref=”http://www.pcgive.com/pcgets/index.html?content=/pcgets/gtscrits.html”>Here’s Doornik’s qualified defence of data mining as applied in PcGets.

Posted by dsquared · February 13, 2004 12:31 PM

working link

Posted by dsquared · February 13, 2004 12:34 PM

Note: in statistics and biostatistics, ‘data mining’ is generally considered to be a bad thing. And replacing stepwise regression with any other statistical procedure shouldn’t solve the problem, since it’s a problem of multiple testing, not just of correlations.

From what I heard, neural networks researchers had to learn about this, even while using complex nonlinear models. They also had to learn about what happens when your model has as many parameters as data points.

Posted by Barry · February 13, 2004 02:05 PM

Barry: There is nothing wrong with multiple testing as long as you’re very careful about interpreting the results (after all, it would otherwise be utterly impossible to ahve a valid model of GDP since so much work has already been done on the time series). Have a look at Doornik’s responses to criticisms of PcGets.

Posted by dsquared · February 13, 2004 02:46 PM

In Statistics, the term “data mining” did historically refer to sifting through data for significant effects, with spurious results consequently produced by implicit selection bias. (For example, picking the “most significant” coefficient in a regression involves an implicit maximum over a many statistics, and the null distribution of the maximum is unlike that of an arbitrarily chosen statistic.) Definitely a bad thing, though I don’t think the term itself really captures this negative sense. (Data sifting might have been better, or as suggested above, data dredging.)

Computer scientists noticed this and reasoned that those little chunks of gold coming out of the mountain can be valuable. So they redefined the term accordingly. It’s now often taken to mean algorithmic methods for extracting information from large data sets without the necessary association with spurious effects. We even have a course on Data Mining now; many of our students have never encountered the historically negative meaning.

On stepwise regression, I agree with Daniel that it can be done well if handled carefully. But the variable selection problem has also received a great deal of research attention in Statistics over the past decade. Developments include new information criteria, prediction error estimates, the Lasso (an absolute error criterion), new multiple testing methods like False Discovery control, and Bayesian approaches. These frequently outperform traditional stepwise regression and related methods. Moreover, more attention is now going to model averaging, where one incorporates the uncertainty about the model into the inferences rather than “throwing it away” in choosing just one.

To follow up on Daniel’s last point regarding multiple hypothesis testing, that area has also undergone a renaissance in recent years, spurred in large part by Benjamini and Hochberg’s False Discover Rate control. There is now a much wider range of error measures that can be controlled, even for millions of simultaneous tests, eliminating much of the interpretive angst traditionally associated with this problem.

Posted by Chris Genovese · February 13, 2004 03:25 PM

I think most sociologists look at data mining in a negative light as well. But it may depend on your training and how you feel about use of quantitative data. I was taught - and prefer this approach - that you should usually theorize even what are often simply referred to as “the usual control variables” so I prefer to have conceptual reasons for including variables in an analysis and that’s what I’m teaching my students as well.

Somewhat related is a point that came up in class yesterday when we were discussing a piece by an economist. One of the students noted that the author had not controlled for the usual variables when looking at how computer and Internet use at work may be influencing hours worked. I noted to the student that the author had, in fact, controlled for many many factors but did not include these in the table. Rather, they are listed in the appendix (not in a table with coefficients, just as a list of controls included in the analyses). I don’t ever recall seeing that done by a sociologist. (Granted, this is a working paper, I don’t know how often this is done in econ in a final publication.) In case anyone’s curious, the piece we were discussing was Richard Freeman’s NBER working paper on “The Labour Market in the New Information Economy”.

Posted by eszter · February 13, 2004 03:32 PM

I find the Bayesian model averaging approach a very appealing solution to this. It is based on the realization that the reason we think data mining is useful is because we have a priori uncertainty about what the ‘correct’ model is. This uncertainty, however, should not be ignored in our inferences.

Posted by Erik · February 13, 2004 04:14 PM

Epidemiologists and biostatisticians (e.g., as Kieran notes, Frank Harrell) worry about data mining in the perjorative sense for the same reasons that economists do. However, we are also trying to see what can be learned from the marketing techniques (data mining in the positive sense): we want to find problems in clinical practice that we aren’t currently aware of by spotting data anomalies (artificial example: “how come there so few imaging charges associated with appendectomies in the last six months? can they possibly be opening folks up without looking first???”).

Posted by Bill · February 13, 2004 04:33 PM

As another data point, data mining is used perjoratively in physics.

Posted by Aaron Bergman · February 13, 2004 04:41 PM

Another thing you failed to mention is that pretest estimators are equal to neither the OLS or RLS (Restricted Least Squares) estimators. This makes is hard to evaluate pretest estimators.

The sampling distributions are stochastic mixture of the unrestricted estimator and the restricted estimator. Pretest estimators are also discontinuous functions of the data and a small change in the data can bring a switch in the results.

Also many economists use pretest estimators.

Posted by Steve · February 13, 2004 05:38 PM

Cosma Shalizi knows a hell of a lot about this general area, though he may disagree with me over the spelling of his name.

Posted by dsquared · February 13, 2004 06:33 PM

Economists reject data mining based on methodology. The purpose of econometric analysis is hypothesis testing. Those hypotheses must come from theory. Without theory, there is no hypothesis to test. I suppose you could use mining to find hypotheses to test, but you would then have to “start over”, developing a theory which calves off that particular hypothesis, and then testing that hypothesis on yet another data set - the first is contaminated. (The regression statistics wouldn’t mean what you would be claiming they meant, if you used the first data set.)

There isn’t much concern on either side of the divide because the marketeers view the economists’ concerns as foolish, and economists view most business research as garbage. (Hence the low esteem generally given to economics departments in business, rather than academic, colleges.)

Posted by rvman · February 13, 2004 07:59 PM

In both poli sci and demography, the situation is more or less as John describes it. Everyone knows data mining is wrong. Everyone understands, in principle, that if you do it to build your model you ought to then throw the data set out and start again. No one really does this except for graduate students fresh from their methodology class and burning with the desire to Do Things Right. People use stepwise regression with a bit of bad conscience (though I gather that some people are enthusiatsic believers in it), but they use it. They try to develop some kind of theory before starting to mine, try to be honest with themselves and their readers as they go about not coming up with an entirely post-hoc theory to justify the associations that just happen to be significant. They try to avoid the worst excesses of just throwing everything into the model and seeing what comes out, but don’t really design a model entirely in isolation from the data set they’re going to test it on.

Posted by Jacob T. Levy · February 13, 2004 08:11 PM

There was once a Dilbert cartoon about data mining.

Posted by Omri · February 13, 2004 08:41 PM

Eszter’s point about the ‘usual control variables’ is correct. It’s increasingly common to report only the variables of interest. I don’t have a big problem with this, particularly if the desirable practice of making the full dataset and detailed results available on a website is followed.

My big concern is with the use of instrumental variables, which are proxies for some variable of interest constructed from other supposedly exogenous variables in a way that is supposed to avoid bias due to simultaneous causation. That’s all well and good, but the fact is still that the regression relates the dependent variable to the exogenous variables. Yet the procedure used to construct the instrument from those variables is often described poorly and sometimes not at all.

Posted by John Quiggin · February 13, 2004 09:15 PM

Perhaps a Bayesian view would help.

In Bayesian analysis, you always test one hypothesis against others (real ones, not imaginary null and alternative ones :-). For example, if you were fitting a line to data on a 2D graph, you might start with the idea that all slopes and all intercepts were equally likely before you saw the data, then do calculations to see how observing the data changed those probabilities. So, after seeing the data, you would know how likely each possible line was.

For example, take the following model. This model says that there are ten possible slopes (m=1,2,3, … 10) and the intercept of the line equals zero. So there are ten possible lines we are considering.

Each possible line starts with 10% probability (called the prior probabilities, since they are the probabilities you have prior to getting the data). Then, when we see the data, that probability changes (it is multiplied by a factor that depends on how close the data fit each possible line).

After seeing the data, we would then have a new set of 10 probabilities, one for each line, describing how likely each line is now.

Now, let’s do some data mining; let’s look at another model, using the same data. This model, instead of assuming the intercept equals zero, gives it ten possible values (0,1,2,…,10). Now, instead of ten possible lines, we have 100.

This has two effects:
1. It (most likely) makes the best line fit better; after all, instead of just 10 lines to choose from, you have 100.
2. It lowers the prior probabilities; now each line starts with 1% probability rather than 10%.

So, the second model, with two parameters, may give worse results than the first, with only one. For example, if you could only fit the data, say, two times better (in some sense), then it doesn’t outweigh the tenfold loss of prior probability.

So, in comparing the two models, the first one would be better; even though the “best fit” line of the second model is better than the first, the second model ends up with a lower probability. The calculations effectively took Occam’s razor to it, and cut it down to size.

So, Bayesian analysis automatically makes the trade-off between fitting the data and introducing new parameters into a model.

My understanding of stepwise regression is that this doesn’t happen; am I correct? In other words, every time you put a new variable in, the p-value gets better (take one out, it gets worse).

Does this solve the problem of data mining you are having? I’m not sure it does.

Let’s say you test 1000 models on the same set of data. In classical terms, even if there is no “connection” whatever, about 5% will be statistically significant at the 5% level (hey, 50 publishable papers, I’m on my way to tenure :-). This is the original problem with data mining, correct?

What happens in Bayesian terms? You end up with a probability for each model being the correct one.

This assumes that one and only one is correct. The “only one” part is fine, I’m not so sure about the “one” part. In other words, you are assuming that, among the possible hypotheses you have, one is the correct one. Any problems with that? Perhaps not; maybe the outcome of a statistical analysis should be something like “Of the hypotheses we considered, this one is (say) 20 times more likely than all the others combined.”

Then, you assign appropriate prior probabilities to each model. This addresses the “sure-thing hypothesis” problem with data mining. Once you see the data, you could come up with a model that predicts “That data is the only possible thing that could have happened.” This is clearly the winner in terms of fitting the data, but the prior probability of such a hypothesis is too small to consider.

Then, for each model, the analysis above will take into account (a) how well the model fits the data and (b) how many parameters it took to do so. This isn’t a vague, subjective trade-off; it is built into the calculations themselves, as my example above suggested.

So, what are the difficulties with this Bayesian data mining? Real question, BTW; I’m no expert. One problem is that most statistics courses are too “cookbook” and like to use techniques that don’t require much actual understanding of what is going on. For similar reasons, such a method won’t be included in a simple statistical package. Another problem might be calculational; if the above analysis required 1000 nested integrals, then we would have to turn to other methods (MCMC, perhaps). Others?

Posted by bill carone · February 13, 2004 09:31 PM

Alas, the long and short about data mining is that it’s another subject that only people with enough training can decide is good or bad in a given instance. Either it casually alludes to a methodology some “we” all know, or it’s the prelude to a lovefest or a witchhunt.

I usually change the question and ask what is the analytic goal — to summarize, to predict, or to find and pick low-hanging fruit?

The last thing I would ever want is to have a project I’m working on shut down, because a party had some notion about data mining.

Posted by toni wuersch · February 15, 2004 02:10 AM
Followups

→ Me Me Me Me Me Me Me.
Excerpt: I'm in the Leiter Report again, this time for getting a job offer from the left coast. Brian also seems to know more about my future than I do :) UPDATE: Josh Parsons, seen in that link turning old, also has an offer from UC Davis, as well as Michael G...Read more at Thoughts Arguments and Rants
→ On Data Mining.
Excerpt: Over at Crooked Timber, social scientists and economists take potshots at data mining. Over here in the Business Intelligence field, there's a lot more potential for it. As an added value part of a product suite that I am developing,...Read more at Cobb

This discussion has been closed. Thanks to everyone who contributed.