"Theory vs. Data" in statistics too

Via Brad DeLong -- still my favorite blogger after all these years -- I stumbled on this very interesting essay from 2001, by statistician Leo Breiman. Breiman basically says that statisticians should do less modeling and more machine learning. The essay has several responses from statisticians of a more orthodox persuasion, including the great David Cox (whom every economist should know). Obviously, the world has changed a lot since 2001 -- where random forests were the hot machine learning technique back then, it's now deep learning -- but it seems unlikely that this overall debate has been resolved. And the parallels to the methodology debates in economics are interesting.

In empirical economics, the big debate is between two different types of model-makers. Structural modelers want to use models that come from economic theory (constrained optimization of economic agents, production functions, and all that), while reduced-form modelers just want to use simple stuff like linear regression (and rely on careful research design to make those simple models appropriate).

I'm pretty sure I know who's right in this debate: both. If you have a really solid, reliable theory that has proven itself in lots of cases so you can be confident it's really structural instead of some made-up B.S., then you're golden. Use that. But if economists are still trying to figure out which theory applies in a certain situation (and let's face it, this is usually the case), reduced-form stuff can both A) help identify the right theory and B) help make decently good policy in the meantime.

Statisticians, on the other hand, debate whether you should actually have a model at all! The simplistic reduced-form models that structural econometricians turn up their noses at -- linear regression, logit models, etc. -- are the exact things Breiman criticizes for being too theoretical!

Here's Breiman:

[I]n the Journal of the American Statistical Association JASA, virtually every article contains a statement of the form: "Assume that the data are generated by the following model: ..."

I am deeply troubled bythe current and past use of data models in applications, where quantitative conclusions are drawn and perhaps policy decisions made...

[Data generating process modeling] has at its heart the belief that a statistician, by imagination and by looking at the data, can invent a reasonably good parametric class of models for a complex mechanism devised bynature. Then parameters are estimated and conclusions are drawn. But when a model is fit to data to draw quantitative conclusions...

[t]he conclusions are about the model’s mechanism, and not about nature’s mechanism. It follows that...[i]f the model is a poor emulation of nature, the conclusions maybe wrong...

These truisms have often been ignored in the enthusiasm for fitting data models. A few decades ago, the commitment to data models was such that even simple precautions such as residual analysis or goodness-of-fit tests were not used. The belief in the infallibility of data models was almost religious. It is a strange phenomenon—once a model is made, then it becomes truth and the conclusions from it are [considered] infallible.

This sounds very similar to the things reduced-form econometric modelers say when they criticize their structural counterparts. For example, here's Francis Diebold (a fan of structural modeling, but paraphrasing others' criticisms):

A cynical but not-entirely-false view is that structural causal inference effectively assumes a causal mechanism, known up to a vector of parameters that can be estimated. Big assumption. And of course different structural modelers can make different assumptions and get different results.

In both cases, the criticism is that if you have a misspecified theory, results that look careful and solid will actually be wildly wrong. But the kind of simple stuff that (some) structural econometricians think doesn't make enough a priori assumptions is exactly the stuff Breiman says (often) makes way too many!

So if even OLS and logit are too theoretical and restrictive for Breiman's tastes, what does he want to do instead? Breiman wants to toss out the idea of a model entirely. Instead of making any assumption about the DGP, he wants to use an algorithm - a set of procedural steps to make predictions from data. As discussant Brad Efron puts it in his comment, Breiman wants "a black box with lots of knobs to twiddle."

Breiman has one simple, powerful justification for preferring black boxes to formal DGP modeling: it works. He shows lots of examples where machine learning beat the pants off traditional model-based statistical techniques, in terms of predictive accuracy. Efron is skeptical, accusing Breiman of cherry-picking his examples to make machine learning methods look good. But LOL, that was back in 2001. As of 2017, machine learning - in particular, deep learning - has accomplished such magical feats that no one now questions the notion that these algorithmic techniques really do have some secret sauce.

Of course, even Breiman admits that algorithms don't beat theory in all situations. In his comment, Cox points out that when the question being asked lies far out of past experience, theory becomes more crucial:

Often the prediction is under quite different conditions from the data; what is the likely progress of the incidence of the epidemic of v-CJD in the United Kingdom, what would be the effect on annual incidence of cancer in the United States of reducing by 10% the medical use of X-rays, etc.? That is, it may be desired to predict the consequences of something only indirectly addressed by the data available for analysis. As we move toward such more ambitious tasks, prediction, always hazardous, without some understanding of underlying process and linking with other sources of information, becomes more and more tentative.

And Breiman agrees:

I readily acknowledge that there are situations where a simple data model maybe useful and appropriate; for instance, if the science of the mechanism producing the data is well enough known to determine the model apart from estimating parameters. There are also situations of great complexity posing important issues and questions in which there is not enough data to resolve the questions to the accuracy desired. Simple models can then be useful in giving qualitative understanding, suggesting future research areas and the kind of additional data that needs to be gathered. At times, there is not enough data on which to base predictions; but policydecisions need to be made. In this case, constructing a model using whatever data exists, combined with scientific common sense and subject-matter knowledge, is a reasonable path...I agree [with the examples Cox cites].

In a way, this compromise is similar to my post about structural vs. reduced-form models - when you have solid, reliable structural theory or you need to make predictions about situations far away from the available data, use more theory. When you don't have reliable theory and you're considering only a small change from known situations, use less theory. This seems like a general principle that can be applied in any scientific field, at any level of analysis (though it requires plenty of judgment to put into practice, obviously).

So it's cool to see other fields having the same debate, and (hopefully) coming to similar conclusions.

In fact, it's possible that another form of the "theory vs. data" debate could be happening within machine learning itself. Some types of machine learning are more interpretable, which means it's possible - though very hard - to open them up and figure out why they gave the correct answers, and maybe generalize from that. That allows you to figure out other situations where a technique can be expected to work well, or even to use insights gained from machine learning to allow the creation of good statistical models.

But deep learning, the technique that's blowing everything else away in a huge array of applications, tends to be the least interpretable of all - the blackest of all black boxes. Deep learning is just so damned deep - to use Efron's term, it just has so many knobs on it. Even compared to other machine learning techniques, it looks like a magic spell. I enjoyed this cartoon by Dendi Suhubdy:

Deep learning seems like the outer frontier of atheoretical, purely data-based analysis. It might even classify as a new type of scientific revolution - a whole new way for humans to understand and control their world. Deep learning might finally be the realization of the old dream of holistic science or complexity science - a way to step beyond reductionism by abandoning the need to understand what you're predicting and controlling.

But this, as they say, would lead us too far afield...

from Noahpinion http://ift.tt/2vFClMw

Business Loan

"Theory vs. Data" in statistics too

Share this:

Labels

"Theory vs. Data" in statistics too

by Unknown

Share this:

Labels