Separating Hyperplanes

Why do statisticians use standard deviation?

Matthew Martin 4/02/2014 03:32:00 PM

You probably learned how to calculate a standard deviation[*] in a high school math class. It's a rather strange procedure: you take the square root of the sum of the squared differences between individual values an the mean value. A question I often get asked is why do we care about this funky metric, instead of a more intuitive measure like the mean absolute deviation (the average difference between individual values and the mean)? Well, there are a variety of reasons.

Ultimately, the goal of these calculations is to produce a summary measure of the degree of dispersion of a probability distribution--in layman's terms, we want to calculate how far your data points are from the average.

The first thing to note about absolute deviation is that it does not rank distances at all. Suppose that we have two observations, where one observation is 1 unit away from the mean, while the other is 3 units from the mean. Absolute deviation says that this is the same amount of dispersion as having both observations be 2 units away from the mean. That's fishy--3 units deviation is further from the mean than 2 units of deviation, yet mean absolute deviation is telling us that these two distributions have the same amount of dispersion. However, standard deviation actually gives greater weight to larger deviations than smaller ones, so the standard deviation tells us, correctly, that actually the former distribution is more spread out.

Let's examine that point more closely. Consider two populations, X and Y, summarized in the table below:

X	Y
1	0.5
2	2.5
3	2.5
4	4.5

Both columns have a mean of 2.5. It also happens that they both have a mean absolute deviation of 1, even though Y is intuitively more spread out than X, covering a range from 0.5 to 4.5, instead of from 1 to 4. Going from X to Y, two observations got closer to the mean, while two others went further away from the mean. The formula for standard deviation implicitly ranks these changes based on how far from the mean they are--an increase in distance of the most extreme values affects standard deviation more than an equivalent decrease in the distance of the less extreme values, so that the standard deviation of Y, 1.41, is larger than the standard deviation of X, 1.12.

This gets at a pretty important point: unlike standard deviation, mean absolute deviation does not uniquely characterize the dispersion of a distribution. In statistics, we work with samples and thus don't really know the true population mean. If we want to estimate the population mean, a reasonable approach would be to pick a value for the mean that would minimize the degree of dispersion of the observed data about that value. It is always possible to estimate a mean that minimizes the standard deviation of the data--this minimizer always exists and will always be unique. However, it is frequently impossible to estimate a mean value that minimizes the absolute deviation, because there could be many different estimates that would produce the same mean absolute deviation. As observed above, the mean absolute deviation does not uniquely characterize the distribution. If you do any work with regression models, then this fact is of obvious importance--ordinary least squares estimation is really just a procedure that estimates coefficients that minimize the standard deviation of the data. If we used mean absolute deviation instead, then regression simply wouldn't work most of the time.

But aside from these practical considerations, standard deviation actually has strong theoretical appeal as a model parameter. You've probably heard of the normal distribution, which is by far the most widely used and most important probability distribution there is. It turns out that we can uniquely define a normal distribution with just two pieces of information: the mean and the standard deviation. This is not true of absolute deviations--there could be many different normal distributions that have the same mean and mean absolute deviations but which differ wildly in other respects. I do not know if it is possible to write a formula describing a normal distribution in terms of mean and mean absolute deviation, but if it is, it would certainly require more than two parameters--most likely, a lot more. Hence, while it may seem wacky at first, standard deviation actually represents a simplification versus the less tractable metric of mean absolute deviation.

There's also another, deeper mathematical reason for the appeal of standard deviation. We can describe the mean of a distribution as the "expected value"--that is, the most likely value we are likely to observe. So lets suppose that we have a variable Z with a probability distribution such that the population mean, or "expected value" is zero. It turns out that the variance of Z is equal to the expected value (or mean) of Z*Z (that is, of Z squared). This is known as the "second moment." There are additional moments: the expected value of Z*Z*Z is the skewness of Z, the expected value of Z*Z*Z*Z is the kurtosis, and so on. For distributions where the mean is not zero, you need to modify those formulas slightly to get "central moments" of the distributions, but the broad concept remains intact--variance (and hence, standard deviation) is one of the fundamental metrics, along with all the higher-order moments that uniquely characterize a probability distribution, while mean absolute deviation is not. It's not quite true that you can deduce the whole probability distribution by knowing all the moments (the pareto distribution, for example, has no moments), but if all the moments are finite and you know what they are, then you know what the probability distribution is. This knowledge can lead us to some highly general, extremely powerful statistical techniques (for more, see "method of moments").

[*]I use standard deviation and variance fairly interchangeably. The former is just the positive square root of the latter, but the issue is fairly superficial since this is a monotone transformation--two distributions having the same standard deviation have the same variance and vice versa, and anything that minimizes the one also minimizes the other.

Anonymous 4/13/2014 09:14:00 PM

I just spent 30 minutes trying to find this explanation on other lesser websites. Finally I understand why standard deviation is more powerful than average deviation. Thank you!

Anonymous 12/02/2014 10:52:00 PM

I couldn't agree more with the comment above. Excellent explanation. Many thanks.

Fredri 1/27/2015 01:52:00 PM

Using regression to justify SD is using an assumption to justify an assumption - just because we want regression to work, doesn't mean it should work. The use of SD is nearly completely due to it's mathematical properties - that is it is easier to manipulate than mean deviation. This is not a scientific or philosophical reason to use something. Furthermore it's links with a normal distribution are irrelvant, because they only hold in the situation when a the distribution is *perfectly* normal, which in practice, in science, never actually happens. Suggest you read this paper

http://www.leeds.ac.uk/educol/documents/00003759.htm

Anonymous 3/20/2015 08:49:00 AM

Thanks a million for a practical explanation! It may be disputed vehemently by different readers or other practitioners but it helps a lay reader understand why or why not one should spend their time working out SDs at all.

Anonymous 11/06/2015 04:04:00 PM

Good stuff. More good stuff: "Use standard deviation (not mad about MAD)" at http://www.win-vector.com/blog/2014/01/use-standard-deviation-not-mad-about-mad/

Karl Nord 6/10/2016 11:38:00 AM

Thanks good stuff!

Anonymous 7/06/2016 06:24:00 AM

In your numerical example, I fail to see how "Y is intuitively more spread out than X". Yes, on the one hand, 0.5 and 4.5 are further away from the mean (2.5) than are 1 and 4.

But on the other hand, 2.5 and 2.5 are certainly closer (indeed equal) to the mean (2.5) than 2 or 3.

I'd say simply that one can reasonably disagree about whether X or Y is "more spread out". It's certainly far from obvious (as you try to present it).

The example in the preceding paragraph suffers from the same flaw.

Unknown 10/02/2016 05:28:00 PM

You say that both (X,Y) have a MAD of 1, however to my calculation X = 0.75. Please, can you tell me what I did wrong?

Fictionarious 6/17/2017 09:31:00 PM

"It turns out that the variance of Z is equal to the expected value of Z*Z"

You mean its equal to the expected value of ZZ minus the square of the expected value of Z?

I'm looking for more information on this topic. The mean and SD uniquely characterize a normal distrobution, that much is clear. But how does SD "uniquely characterize the dispersion of a distrobution" generally?