Why do statisticians use standard deviation?
Matthew Martin 4/02/2014 03:32:00 PM
Ultimately, the goal of these calculations is to produce a summary measure of the degree of dispersion of a probability distribution--in layman's terms, we want to calculate how far your data points are from the average.
The first thing to note about absolute deviation is that it does not rank distances at all. Suppose that we have two observations, where one observation is 1 unit away from the mean, while the other is 3 units from the mean. Absolute deviation says that this is the same amount of dispersion as having both observations be 2 units away from the mean. That's fishy--3 units deviation is further from the mean than 2 units of deviation, yet mean absolute deviation is telling us that these two distributions have the same amount of dispersion. However, standard deviation actually gives greater weight to larger deviations than smaller ones, so the standard deviation tells us, correctly, that actually the former distribution is more spread out.
Let's examine that point more closely. Consider two populations, X and Y, summarized in the table below:
This gets at a pretty important point: unlike standard deviation, mean absolute deviation does not uniquely characterize the dispersion of a distribution. In statistics, we work with samples and thus don't really know the true population mean. If we want to estimate the population mean, a reasonable approach would be to pick a value for the mean that would minimize the degree of dispersion of the observed data about that value. It is always possible to estimate a mean that minimizes the standard deviation of the data--this minimizer always exists and will always be unique. However, it is frequently impossible to estimate a mean value that minimizes the absolute deviation, because there could be many different estimates that would produce the same mean absolute deviation. As observed above, the mean absolute deviation does not uniquely characterize the distribution. If you do any work with regression models, then this fact is of obvious importance--ordinary least squares estimation is really just a procedure that estimates coefficients that minimize the standard deviation of the data. If we used mean absolute deviation instead, then regression simply wouldn't work most of the time.
But aside from these practical considerations, standard deviation actually has strong theoretical appeal as a model parameter. You've probably heard of the normal distribution, which is by far the most widely used and most important probability distribution there is. It turns out that we can uniquely define a normal distribution with just two pieces of information: the mean and the standard deviation. This is not true of absolute deviations--there could be many different normal distributions that have the same mean and mean absolute deviations but which differ wildly in other respects. I do not know if it is possible to write a formula describing a normal distribution in terms of mean and mean absolute deviation, but if it is, it would certainly require more than two parameters--most likely, a lot more. Hence, while it may seem wacky at first, standard deviation actually represents a simplification versus the less tractable metric of mean absolute deviation.
There's also another, deeper mathematical reason for the appeal of standard deviation. We can describe the mean of a distribution as the "expected value"--that is, the most likely value we are likely to observe. So lets suppose that we have a variable Z with a probability distribution such that the population mean, or "expected value" is zero. It turns out that the variance of Z is equal to the expected value (or mean) of Z*Z (that is, of Z squared). This is known as the "second moment." There are additional moments: the expected value of Z*Z*Z is the skewness of Z, the expected value of Z*Z*Z*Z is the kurtosis, and so on. For distributions where the mean is not zero, you need to modify those formulas slightly to get "central moments" of the distributions, but the broad concept remains intact--variance (and hence, standard deviation) is one of the fundamental metrics, along with all the higher-order moments that uniquely characterize a probability distribution, while mean absolute deviation is not. It's not quite true that you can deduce the whole probability distribution by knowing all the moments (the pareto distribution, for example, has no moments), but if all the moments are finite and you know what they are, then you know what the probability distribution is. This knowledge can lead us to some highly general, extremely powerful statistical techniques (for more, see "method of moments").
[*]I use standard deviation and variance fairly interchangeably. The former is just the positive square root of the latter, but the issue is fairly superficial since this is a monotone transformation--two distributions having the same standard deviation have the same variance and vice versa, and anything that minimizes the one also minimizes the other.