Separating Hyperplanes

Why Do We Measure Populations by Polity?

Matthew Martin 4/10/2013 09:20:00 PM

Not a very interesting post today, but since my last post on the study by Kendig and Chen, I've noticed a lot of critiques, mostly from non-scientists, saying something like this:

The statistic of % of counties is meaningless. Some have millions and others have 10,000, including a lot of those red counties. We can only conclude that the percentage by population is much less dramatic.

That was a comment by Observer21 over at Sarah Kliff's post about this paper. Since one of the main reasons I started this blog was to bridge the gap between science and non-scientists, I thought this makes a teachable moment.

First of all, the commenter has a point. Counties are not an optimal way to divide the data up into units. Yet social scientists do it all the time. And when they aren't doing it by county, they use states and countries instead, which are also arbitrary political boundaries. But the part about the "percentage by population" doesn't quite make sense. Sure, we'd love to know what percentage of women experienced a decrease in mortality rates over the 20 year period, but unfortunately this is not an observable characteristic, since we can't actually observe each individual's probability of death each year--our data only tells us dead or alive, not the ex-ante probability of death. So the only way of attempting to look at the distribution of the changes in female mortality across women is to divide the population up into subgroups, and compare mortality rates across subgroups--in this case, the authors decided to divide the population into subgroups based on their geographic location.

The question remains, though: why based on arbitrary county lines? Why not draw geographic districts that each have roughly the same population? The reason social scientists stick to pre-existing political boundaries, like county or state lines, to define their dataset is because it is far, far easier than the alternative. Drawing our own districts, thousands of them, and individually matching each observation to our newly drawn district is labor intensive and time consuming, and therefore also very expensive. Moreover, it is generally not possible since publicly available datasets like the census and BRFSS don't literally give us the GPS coordinates of each individual--instead we have to rely on the geographic identifiers they provide, which are generally no more specific than county-level, and often not even that. Why do the census and BRFSS surveys not define more meaningful geographic districts? I think the answer here is two fold: first, ultimately any district map is going to be arbitrary anyway, since populations shift overtime, while the districts have to remain the same year-to-year to have any statistical usefulness; second, for many surveys and datasets the county governments are an instrumental part of the data collection process. Ultimately the county-level government apparatus does the leg work in state-level programs like medicaid, meaning that these records are available on a county-by-county basis. Given that we want various datasets to be compatible with eachother as much as possible, this provides ample reason to identify individuals in all national surveys on a county basis.

So, county-level analysis for population studies makes sense. But it is important to be vigilant against arguing the fallacy of composition--it is possible, for example, for the mortality rates to rise in 100% of counties but fall in the national aggregate, due to the movement of people between counties. So without some more detailed analysis about population movements, all we can say from the county map on Sarah Kliff's blog is that there were significant regional disparities in female mortality trends.