### Correlation does not imply causation

1/05/2014 06:57:00 PM
Tweetable
Statistical causation is a difficult concept that I've written about before. I'm not the only one to notice that the phrase "correlation does not imply causation" is a much over-used retort to statistical claims, a retort that people inappropriately use to baselessly defend their own priors in the face of contrary evidence. So let's be clear about something: mathematicians use language a bit differently--and much more precisely--than the average person. Most people, I think, believe the phrase "correlation does not imply causation" means that correlation never ever implies causation. That's a bit different than the actual mathematical meaning of the phrase, which is that correlation may sometimes imply causation, but there exists at least one counterexample in which you can have correlation but not causation. So a better way of making this point might be "not all kinds of correlation are sufficient to infer causation." But there are kinds of correlation that are sufficient to infer a high likelihood of causation, most notably the three major types of reduced form "causal inference" models: difference-in-differences, regression discontinuity, and instrumental variables. If the assumptions behind those statistical models are satisfied, we can be pretty sure that the correlations they report are causal.

But the main thing I want to point out here is not that correlation sometimes does imply causality--which it does, sometimes--but that causality is not always the thing we actually should care about. If we want to make a policy prediction, such as predicting what will happen to employment if we lower the top marginal tax rate, then we need to make darn sure our correlation implies causation. That's because our forecast is conditional: what happens to employment conditional on lowering the top marginal tax rate. However, if instead all we care about is predicting what will happen to employment next month, irrespective of what congress does with tax rates, then we don't really need to know anything about causation at all. Because our forecast is unconditional, raw correlations contain all the information we actually want, and an understanding of the economics involved is not required.

So if all we want to do is predict January's employment figures, then all we need to do is find some things that are highly correlated with monthly employment--regardless of whether they "cause" employment in any way--and use that correlation to guess what the January figures will be. It turns out, actually, that monthly employment figures are highly correlated with previous months' employment figures. So all we need to do is estimate an ARMA model, which will calculate all the correlations between current and previous data. Once we have those correlations calculated, we can plug in the most recent data into the equation, and get a prediction. This prediction is actually the "point estimate" of our forecast--to get the complete forecast we compute what is called the "prediction interval" which consists of upper bound and lower bound estimates such that we are 95% confident that January's figures will be between these bounds.

Let me actually illustrate the employment data. Pursuant to my commitment to make science free and open-source, I've provided the data and script, and used only free platforms: If you want to follow along here you can run this code in the statistical program R (you need to have the 'forecast' library installed):
emp<-ts(PAYEMS,frequency=12,start=c(1939,1))
library(forecast)
emparima=auto.arima(emp,ic="bic")
empforecast<- forecast.Arima(emparima,h=5)
empforecast
plot.forecast(empforecast, include=100)
What that script does is access a text file containing the raw data (which is taken straight from FRED), turn it into a time-series object in R, automatically select the "best" ARIMA model by minimizing the Bayesian-Schwarz Information Criteria (among other things), use the estimated ARIMA model to predict the next five data points, and plot the forecast along with the 95 most recent data points. It turns out that the program selected an ARIMA(1,1,2) with drift. Being unfamiliar with the forecast library, I'm unfamiliar with how much work auto.arima does, but you can verify that this model is stationary by using the Augmented Dickey-Fuller test, for example. As promised, we can use the model to forecast the next 5 months (note that December data has not yet been released, so we are forecasting that too).

### Upper Bound

December 2013136,943,500136,592,400137,294,500
January 2014137,119,100136,556,200137,682,000
February 2014137,288,700136,494,200138,083,100
March 2014137,452,800136,414,600138,491,000
April 2014137,612,000136,322,700138,901,300
And here it is in chart form:

Note that I didn't actually do any causal inference here. While I did test for things like stationarity and optimize the ARIMA model fit, it remains the case that the ARIMA model describes correlations that needn't be causal. And that's ok because all we need is for the estimated model to be a consistent predictor of future values, not to describe actual causal relationships. Based on the results, we can be 95% confident that employment will be between 137,487,200 and 137,682,000 people in January.

Let me offer another, perhaps more concrete example from a discussion I had recently. It turns out that there is a correlation between relative finger lengths and sexual orientation among men: I'm not going to detail the exact standards of measurement used (you need to make sure everyone is holding their hand in the same exact way to get accurate measurements), but the gist is that gay men are more likely to have a longer index finger than ring finger, while straight men are more likely to have a longer ring finger (yes, I'm aware of one conflicting study, but with all due respect, their data is totally bogus). So on this basis, a remark I made was that if you want to know if a guy is gay, you could compare his finger lengths. There are several valid critiques to such a remark. For one thing, if he hasn't "come out" to you, then a guy's sexual orientation is really none of your business and you should stop speculating about it. Another perfectly valid critique is that while there is a strong correlation between finger length and sexual orientation, this is not a perfect correlation, so some straight men do have "gay" hands, and some gay men have "straight" hands. Yet the blowback I got was the same old "correlation does not imply causation." That's a totally inappropriate time to use that phrase, because we aren't making a causal claim. While it is not true that physically shortening a guy's ring finger will turn him gay (that would be the causal claim), it remains totally true that if he has a shorter ring finger, he is more likely to be gay. We don't need causal inferences to form accurate forecasts.

If you're curious about the biology involved, it is well established that the length of the ring finger in men is heavily influenced by fetal androgen levels, with longer ring fingers being correlated with higher levels of fetal androgens, while the other fingers are controlled more by other factors. So lower levels of fetal androgens will make it more likely that your ring finger turns out to be shorter than your index finger. A plausible causal explanation for the correlation between relative finger lengths and sexuality, then, is that those same hormones also influence your sexual orientation, so that lower levels of fetal androgens could be a contributing factor to homosexuality (it's clearly not "the cause" of homosexuality, as there are many other factors that also influence sexual orientation).

To sum up correlation does not necessarily imply causality, but we don't need causality for correlations to be useful in making predictions, whether we are predicting next month's economic data or a person's sexual orientation. Obviously those predictions can turn out to be wrong--which is why speculating about someone's sexuality is a bad thing to do--but the wrongness of the prediction has nothing to do with causality, but rather depends instead on the magnitudes of the correlations in the data and the stability of the data generating process.