# The Limits of Data and Correlation, from Phil McDonnell

February 22, 2012 |

When we analyze data and find some sort of correlation either positive or negative what have we really found. Have we found cause and effect?

The simple answer is no. Proving correlation cannot demonstrate causation. The fallacy that is at the core of this is that even when two variables are correlated one does not necessarily cause the other. The real underlying cause could be a third unobserved variable that is moving both of observed variables.

An example of this might be that we observe that the stock market and bond market move together over a period of time. That does not mean that one is causing the other. In reality they may both be caused by the Fed's Permanent Open Market Operations (POMO). If that is a variable we have not considered then we are oblivious if it is removed from the economic landscape one day.

All of this begs the question as to whether or not we should be trading on past correlations. Is it just a fool's errand? I think it is not, especially of the correlation is strong enough. But it does expose us to the risk that the hidden real cause will evaporate someday without our being aware of it. That is the risk of speculation. We must be ready to give up a system or anomaly that has worked in the past if it suddenly stops working for us.

## Yishen Kuik writes:

I am far from qualified to speak with any authority on statistics, and my training in mathematics was only as an undergraduate focusing on number theory.

My only claim as to why my opinion on this matters is that I have been operating a statistical trading book for some years and have not yet been swallowed up by the market.

I find that I can get most of the answers I need with fairly basic statistical tools, as long as I ask the right questions with them. I have also found that most advanced tools have to used with care. I want to be able to rely on the results I get with tests, and advanced tools tend to have specifications and nuances that I find troublesome to be familiar enough with that I can use the tool with confidence.

I am surprised at how confident many people, especially those in academia, are in the results they get from using very involved statistical techniques. Even when using very simple tools, I find that I have to think very carefully about the range of explanations for results and how vulnerable they are to various quirky aspects of the data. The Chair's point about how fat tails can be the result of aggregated gaussians or how arc sine can lead to unexpected distributions of highs and lows are good examples of this. In practical usage, I find that such unexpected results are quite commonplace. With complex tools, I am concerned that I may be blindsided by unexpected results from the interaction of data attributes with the details of the implementation that renders my ability to interpret the results correctly. The non-stationary nature of financial time series, the single history, the memory, the regime based volatility and many other aspects of markets tends to really screw up many statistical tools. It is too hard for me to look through the details of the advanced tools and think about how the perversity of financial time series might affect the results in complex tools before I can even contemplate using them with any confidence.

I find that to get the right answers, it is more important to sit down and think and come up with the right list of questions to ask, the answers to which in total should reveal the bigger answer you want to find. For causality and correlation, I doubt if there is a "just add numbers" tool that will give you a worthwhile result.

My algorithm for answering such a question would be to draw a warm bath and sit in it for a while. Then in about 2 or 3 days, usually in the early morning for me, a list of questions will come to me, the combination of answers to which will address the correlation/causation issue, and then later at my office I can construct the tests necessary to express those questions in a few hours.

`SELECT * FROM wp_comments WHERE comment_post_ID = '7216' AND comment_approved = '1' ORDER BY comment_date`