A few thoughts on statistical significance

5/06/2013 06:31:00 PM
Tweetable
The new study results from the Oregon Medicaid experiment have everyone weighing in on how to interpret various statistically significant and insignificant results.

One problem that has been pointed out is that some people are improperly comparing statistical significance across results. This isn't just technically invalid, but actually downright stupid when you think about it. For example, it is possible for two studies to find exactly the same results, but one of them is statistically significant while the other is not (if, for example, one had more observations than the other). Even if the two results are different, if the point estimate for the one falls within the confidence interval of the other, then we really can't say that the two results differed at all.

Other problems have to do with the interpretation of P-values. Some people yelled at Austin Frakt for saying that a P-value of 0.07 was "almost significant." Now, we have a problem if scientists are adjusting significance thresholds in response to the results of the studies, but as that is clearly not the case, Frakt's interpretation is completely valid in a frequentist world--the P-value is a continuous measure of evidence.

On a more general note, no one is quite sure what a P-value is. Pedagogically, it is the conditional probability of observing the data (technically, the discrepancy between some measure of the observed and hypothetical data) given that the null hypothesis is true. Stating it that way, I think, reveals how absurd it is to suggest that the P-value is the probability that the null is true--this is, in fact, the probability that we have specifically conditioned out. If that P-value concept sounds circumspect, that's because it is. Scientists would love to know the probability that the null is true, but we cannot observe unconditional probabilities--the data we observe is all conditional on the truth being true. So instead we essentially hunt for discrepancies between the observed data and data predicted from a theoretical model in which the null is true. If there is an unlikely discrepancy, we "reject the null."

That brings me to the next thought, which is that failing to reject the null should not in general mean accepting the null. Too often I see people talk about accepting versus rejecting the null--you are never supposed to "accept" the null hypothesis. Failing to find evidence against the null is not the same thing as finding evidence for it. In policy, it is often necessary to assume one or the other hypothesis--for example, a doctor must either assume a patient needs treatment or doesn't, and there is no neutral position. As a risk-management technique, we deal with this problem by making the null whichever reality in which it would be more damaging if we did the wrong thing. On a related note, rejecting the null also isn't necessarily the same as accepting the alternative. Or rather, I mean it's not the same as accepting your alternative. Case-in-point: rejecting the null hypothesis that the results of your randomized experiment were random should not lead you to conclude that magic is real. At best, you can conclude that the results were non-random. That is, we can only make statistical distinctions between hypotheses that are not only mutually exclusive, but complementary.

A lot of this confusion could be cleared up if, instead of reporting P-values, we reported confidence intervals. But I'm sure we'd have plenty of misinterpretations there too. The correct interpretation of a 95% confidence interval is that we are 95% "confident" that the true parameter is within the interval. But then, that's where frequentist assumptions come back to bite us--we have to say "we are 95% confident" rather than "the probability is 95%" because the later requires knowledge about priors that frequentists profess not to have. The Bayesians have us beat on that point. But the point I really want to make is this: inevitably someone asserts that the probability (or "confidence") of observing an outcome outside the 95% interval is 5%. NO! The confidence interval is the range for the parameter values, not individual observation values. In fact, you will typically observe lots of data falling outside the computed confidence interval. If forecasting is what you want, then you need to compute a prediction interval, not a confidence interval.