Google strikes again. With its huge computing power it is changing the way science is done, according to The End of Theory: The Data Deluge Makes the Scientific Method Obsolete, an article by Chris Anderson in Wired magazine.

Some excerpts:

"All models are wrong, but some are useful." So proclaimed statistician George Box 30 years ago. […] Speaking at the O'Reilly Emerging Technology Conference this past March, Peter Norvig, Google's research director, offered an update to George Box's maxim: "All models are wrong, and increasingly you can succeed without them."


Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise. But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete.


There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

Dylan Distasio replies:

I didn't see any compelling argument in that article that the scientific method is on the verge of becoming obsolete.

While cloud computing can be a great tool for analyzing protein folding or looking for extraterrestrial signals, I don't see how throwing "the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot" is a viable approach.

This sounds like the ultimate in overfitting data (although I guess there is no model!). Venter may have come up with a brilliant software tool for rapidly sequencing DNA, but it's only useful within the context of a genomic model that was built using the scientific method.

I still see noise without a framework to hang the data on, and a testable (i.e., potentially refutable) hypothesis. 

William Smith relates his experiences:

StatsI work with large amounts of data all the time. Failure to understand the mechanism and specify the model form based on the mechanism leads one to make worthless models that fail miserably.

The noise level is very high in these types of datasets. Many "patterns" are just random but algorithms will treat them as real. The "kitchen sink" approach of throwing many algorithms on a cluster of processors at a large set of data is guaranteed to find something. If all that's there is noise, they fit the noise. When tested on new data it had never been exposed to before, the model won't work. I have seen even the most sophisticated machine learning algorithms fail, support vector machines, for example.

I have worked with many modelers who have tried the automated brute force approach and they have never once managed to solve the problem they were working on.

I have built models in the field of chemometrics that I patented which use only two variables because I understood the mechanism. It was an iterative process, and I certainly didn't get it all correct in the first cycle. In fact, I have had some outstandingly stupid ideas. But the process of testing, refining, discarding, and generating ideas is hypothesis driven science, and that is a process which hypster computer scientists cannot perform. Never seen one of those guys capable of doing it in the last 15 years. If this is where Google is going, they are in deep kimchee and wasting a lot of money. Looking for patterns in the hay does not find the needle.





Speak your mind


Resources & Links