### No, you can't control for that

12/30/2014 02:59:00 PM
Tweetable

EARLIER THIS month, Ezra Klein wrote a post about the use of statistical "controls" in academic studies. If statistics isn't your thing, then the background: "controls" are just the list of variables you include in your statistical model other than the the variable whose effect you are attempting to estimate. Klein says

"You see it all the time in studies. "We controlled for..." And then the list starts. The longer the better. Income. Age. Race. Religion. Height. Hair color. Sexual preference. Crossfit attendance. Love of parents. Coke or Pepsi. The more things you can control for, the stronger your study is—or, at least, the stronger your study seems. Controls give the feeling of specificity, of precision."
Is this really what people think of controls? Do the study authors put this much faith in the value of controls? I'm genuinely baffled if they do.

Klien went on to mention the downside of including too many controls

"But sometimes, you can control for too much. Sometimes you end up controlling for the thing you're trying to measure."
--basically, if the thing you care about affects the things you control for which in turn affect the outcome you care about, then your estimate is missing a portion of the effect you care about. But there's a deeper problem with the sentiment that Klein described. Adding the right controls won't do anything to guarantee that your estimate is causal. And the best study designs are the ones that minimize the number of control variables needed. First, let's examine what controls variables are for.

Suppose you want to estimate the effect of variable X on variable Y. The third variable Z also affects Y but also affects X so that X and Z are correlated. I get complaints whenever I put math on here, so here's a graph of the causal relationships:

In this system, X and Z are both exogenous to Y, and if we regress X and Z on Y the coefficient of X will be an unbiased estimate of the true (linear) causal effect of X on Y. But if we omit Z from this, we get a biased estimate as the regression will falsely attribute some of the variation in Y to X, because we haven't told it about Z. I've concocted a simulation you can run in R to see this--it runs a simulation of the causal relationship above 1000 times, and records the proportion of the time that the true causal effect of X on Y specified in the program does not fall within the 95 percent confidence interval of the regression coefficient, both when Z is and is not included in the regression. With Z included, the true coefficient is outside the confidence interval just 5 percent of the time, exactly as it should be, while it falls outside the interval 100 percent of the time when Z is excluded. This is omitted variables bias. It applies when both these conditions hold:
1. the control variable is correlated with your independent variable of interest, and
2. the control variable has a non-zero effect on the outcome variable.
It's unfortunately pretty common for applied papers to lump any and all variables related to their outcome variable as controls in a regression, but you really shouldn't control for something unless both points above apply. Under classical conditions, including too many variables shouldn't bias the estimate, but in practice you'll get better results by excluding variables whenever it is valid to do so.

Frequently, though, I notice commentators and study authors perform a subtle bait-and-switch when they talk about controlling for various factors. For example, suppose we want to estimate the effect of the number of police officers on crime rates. Crime rates vary by neighborhood, and governments typically assign more police officers to higher crime neighborhoods. This situation sounds superficially like the paragraph above: two regressors, neighborhood and number of police, which are correlated with each other and which both affect crime rates. A naive researcher might regress police on crime, controlling for neighborhood, and claim that the estimate is causal. But it's not! There's been a subtle shift from discussion of neighborhood effects given a certain number of police on crime to the effects of crime on the number of police in those neighborhoods. There's no way to control for the latter. If you actually did this you'd probably conclude that police presence has virtually no effect on crime, but that's wrong.

The causality chart in the example above actually looks like this:

Because governments assign more cops to high-crime neighborhoods, and more police reduces the crime rate in those neighborhoods, your data is not going to show as strong of a correlation between police and crime as it would otherwise. OLS regression has no way to pick apart the effect of police on crime from the effect--via policy makers who watch crime rates--of crime on police numbers. This variation simply does not exist in the data, so the correlation cannot be estimated. This is endogenous variables bias, and there's no number of control variables you can add that will help.

While you can certainly have both endogeneity and omitted variables bias, in my simulation with endogeneity added, the estimate was wrong 100 percent of the time with Z in the regression, versus 100 percent of the time with Z omitted.

What we need to get a true causal estimate of the effect of police on crime is to identify a source of exogenous variation in police levels. This is different than adding control variables. One strategy would be to perform an experiment by randomly assigning police officers to neighborhoods. That would eliminate the endogeneity because randomized police presence would mean that crime rates have no effect on police presence, so it is safe to attribute variation in crime rates to the variation in police levels. Incidentally, we don't actually have to "control" for neighborhood effects now, because with perfect randomization neighborhoods are no longer correlated with levels of police, so there's no omitted variables bias either. This is what I mean when I say that the best study designs are the ones that minimize the number of control variables--the closer to random our treatment variable is, the less of both endogeneity and omitted variables bias we are likely to face.

For the most part, having a very long list of control variables is actually evidence of the weakness of the underlying study design. Controlling for these variables is better than not when there is risk of omitted variables bias, but a design that has such risks is considerably weaker than one that does not. Bottom line: be suspicious whenever a paper says "controlling for ____". There's a good chance you can't actually control for that.

David Andolfatto 12/31/2014 12:49:00 PM
Nice post, Matthew.
Dustin 1/29/2015 11:31:00 AM
Very nicely done. You just blew my mind (and made me want to go back to all the papers where I recommend researchers "control" for things). Ah!