Separating Hyperplanes
You've probably heard of Michael LaCour, the political science grad student who will probably never graduate now that everyone knows his provocative and surprisingly conclusive study, in fact, used data he fabricated. Political science isn't my field, and I don't have as good an eye for context as Andrew Gelman, so when the study originally came out I didn't grasp how significant the results were--or rather would have been, if they were real. The paper described an experiment in which gay and straight canvassers went around trying to persuade people to support gay marriage, and found that gay canvassers persuaded a lot more people than straight ones, suggesting that personal acquaintances have a big effect on people's political views. I was vaguely aware of previous (real) research that had shown that in-person canvassing was the most effective form of political persuasion, so I didn't think much of the new results. But actually LaCour's results were way out of place, showing a massively higher effect than anything previously. As Gelman said when the paper first came out:
"A difference of 0.8 on a five-point scale . . . wow! You rarely see this sort of thing. Just do the math. On a 1-5 scale, the maximum theoretically possible change would be 4. But, considering that lots of people are already at “4” or “5” on the scale, it’s hard to imagine an average change of more than 2. And that would be massive. So we’re talking about a causal effect that’s a full 40% of what is pretty much the maximum change imaginable. Wow, indeed. And, judging by the small standard errors (again, see the graphs above), these effects are real, not obtained by capitalizing on chance or the statistical significance filter or anything like that."
People more queued into political science research than me spotted the implications right away, and this, in addition to the fact that the subject matter of gay marriage was timely, catapulted LaCour and his paper into fame.

But it turns out the data was fake. I recommend reading the excellent documentation from the researchers who expertly caught the fraud. It really is an example not just of talent, but professionalism. Accusations of fraudulent data are thrown around with surprising frequency--mostly at results the accusers personally disagree with--but it's rare to see real investigative work uncover it, offering every reasonable benefit of the doubt in the process.

I suspect that there's a media bias towards fraudulent research. LaCour's paper became so famous precisely because the underlying data was fake, producing a result that was untrue but also wildly surprising and newsworthy. Because the media focuses on interesting results, they will be disproportionately likely to report fake results which are inherently interesting because they don't match the real world. Given this, we need media outlets to cover new research more like the way Gelman did--he did not suspect fraud, but did emphasize how out-of-place the results were, and I think that goes a long way to dampen the misinformation spreading when retractions do happen. And of course, not all retractions are fraud--honest mistakes happen and, to be honest, it doesn't take any mistakes or fraud to get an occasional odd-ball dataset that yields a false positive.

Still, I worry that the reason we eventually learned of this fraud was because LaCour knew just enough statistics to get himself into trouble (what if he had understood "heaping" effects better?) and the two researchers Brookman and Kalla wanted to perform a similar extension to the paper. What if no one happened to try to replicate the methods? How many fraudulent papers and datasets are still out there?

Adam Marcus and Ivan Oransky tell the story of a much bigger fish than LaCour whose fishy data, published in papers undetected for decades, produced a record-breaking 183 retractions, comprising about 7 percent of all retractions over the entire period from 1980 to 2011. Researchers eventually caught fraudster, Yoshitaka Fujii, by using statistical methods to test whether the datasets Fujii produced matched what would be expected by sampling real-world data:
'Using these techniques, Carlisle concluded in a paper he published in Anaesthesia in 2012 that the odds of some of Fujii’s findings being experimentally derived were on the order of 10-33, a hideously small number. As Carlisle dryly explained, there were “unnatural patterns” that “would support the conclusion that these data depart from those that would be expected from random sampling to a sufficient degree that they should not contribute to the evidence base.”'
Marcus and Oransky strongly imply that we should apply John Carlisle's statistical method automatically to all papers submitted to journals, to weed out fraudulent papers. They express frustration at journal editors and others who resist this proposal.

I think regular use of statistical forensics is unlikely to get us anywhere in the long run. Like LaCour, Fujii was caught because he knew just enough statistics to get himself into trouble. If we start screening all datasets using these methods, fraudsters will simply learn how to simulate data that can pass the test. When you aren't constrained to using real-world dataset, it is possible to pass any statistical test. That leaves us with a Bayesian problem: the vast majority of researchers don't commit fraud, and the ones who do will (if they adapt their methods) be the least likely to fail the test. So the proposal actually leaves us with a ton of false positives and a plenty of false negatives as well. Certainly statistical methods are a handy thing to keep in our arsenal, but by itself I do not have great faith that it will detect fraud with the degree of confidence we need it to. It's not fit for routine, universal screening--at best it would keep a few legitimate papers unpublished without finding much actual fraud, and at worst honest researchers would find their careers ruined by false positives.

Does anyone have other ideas for routine, universal fraud screening? I have one: journals should validate the data collection process. The most devastating evidence against LaCour, in my view, is the fact that the survey firm he claimed to use never heard of him, did not have the ability to do the type of survey in question, and never employed the person LaCour claimed was his contact there. A simple phone call from the journal's editorial office could have spotted the fraud right a way. There were other clues related to the data collection: LaCour claimed to have paid for the survey firm with a grant that--as best I can tell--he was never awarded. Journals should make phone calls to grant benefactors, survey firms, and other third parties who would have knowledge about the data collection process for every single paper they accept. Submissions to journals should include not just a section describing what's in their data, but actual documentation of how they got their data, including independent references and their phone numbers.
5/22/2015 12:40:00 PM

A project I'm working on has me thinking about security and identity management on the internet. Between all the interactive stuff and all the paywalls, just about anywhere you go on the internet you get this:

Login
User name:
Password:
So here's my question: why do we need to enter both?

The obvious answer is that the password is what lets you into the website, but the server needs the user name to look up your password in the database. Since the username is unique, this works--the server can look up only the password for your username, compare it to what you entered, and let you in if they match. But two users are allowed to use the same password, so looking up the password directly could produce multiple matches. But what if passwords were unique?

Of course, you can't have a password-only login that tells users whether a password is in use when they pick one, because that would be giving them complete access to someone else's account. But because very large numbers exist in the universe, it is nevertheless possible to issue passwords that are both totally random and totally unique. Programmers do it all the time. For example in C#, the statement Guid.NewGuid(); generates a "globally unique identifier," by which they mean a really big random number reinterpreted as a string of letters and numbers.1 Sure, you'd have to store all these passwords in a password manager--you can't memorize them--but really who can memorize all of their passwords? If you aren't already using a password manager, you might as well consider all your accounts already compromised. And with the manager, it doesn't matter if you get to pick your own password. And with a manager, your life is the same whether those passwords are 8 characters or 80 characters long.

Now, you don't just store a list of user's passwords in your database. Seriously guys, don't do that, regardless of what systems you use. Hackers will steal your database and have access to all the accounts. Instead, websites store cryptographic hashes of passwords in their database. Hashing is the process of converting some text into a number, and we're using a cryptographic hash function that is asymmetric--it can convert text to hashes, but can't convert hashes back to text, so if the hash is stolen the thief still won't be able to sign in to any of the accounts because he'd still need the unhashed versions of the passwords to do so. But that's ok because the hash function is also injective, so if two hashes match, then they must necessarily be the same password. We can, therefore, hash the password that someone enters into the login page, then look for an entry in the database that matches that hash.

That is harder than it sounds. You can't just do a simple select...where lookup for the hashed input. That's because we don't store raw cryptographic hashes of the passwords in the database either. Seriously guys, don't store the raw hashes. It turns out that using a rainbow table with all the usual hashing algorithms, it's still relatively easy to decrypt hashed passwords and gain access to the accounts. No, instead we only hash salted passwords. Salt refers to the technique of corrupting the passwords with randomized data so that all of the hashed passwords appear to be the same length and all appear to be randomized text, offering no clues for decryption. The salt is added in such a way that we--the developers of the hashing algorithm--can unsalt the hashes so that we can compare them to the hashes of user input on the login page. Unfortunately, however, this means that the only way to locate a particular password in a database is to compare it's hash to all of the passwords in the database, unsalting each one-by-one until a match is found. It's worse than that, because we can't merely reject a database entry as soon as it is found to not match the user's input, because attackers can measure the amount of time it takes to reject to gain information about the passwords stored in the database. No, we have no choice but to iterate through every single part of every single password--the most time intensive method possible.

It turns out that even if we used globally unique and random passwords, salt makes a password-only login system fairly unworkable. Usernames actually provide two functions: They are globally unique identifiers, and they are also unsalted keys for fast database lookup.

I'm left to wonder why we treat usernames as if they are public information, not only stored unencrypted on the server, but often even published publicly to all users on the site. Yet, because their second, less obvious function of providing fast database lookup, knowledge of usernames represents real power for attackers, and a real security vulnerability for users. We can't salt usernames, but we can hide them from public view and keep them stored as hashes rather than plain text. In fact, we don't need to force users to manage their own usernames. Instead all we really need is to implement password rules that require part of the user's password to be unique--perhaps just the first four digits of a 12 or more character password--and to store the hash of that portion unsalted. From the user's perspective, they now only need to enter one credential to access their account.


1 Sure, in theory the random number generator could repeat itself--maybe the Imperator Intergalacticus a million years from now will need to promulgate an alternative IGuid method to ensure inter-galactically unique identification across Earth's sprawling billion-galaxy empire, but until then the probability of repeats is zero.

5/18/2015 08:01:00 PM
As happens the first friday every month, the BLS released new jobs data sending twitter abuzz. Most news outlets reported that according to the BLS's initial estimate, the economy added a net 223,000 jobs in April compared to March this year. 223,000 is not a lot, but that bad either. And most journalists will leave it at that, never following up.

But skilled journalists understand that these numbers are soft, and report not just the headline estimate, but also revisions to previous months' estimates. From the BLS press release:
"The change in total nonfarm payroll employment for February was revised from +264,000 to +266,000, and the change for March was revised from +126,000 to +85,000. With these revisions, employment gains in February and March combined were 39,000 lower than previously reported."
Generally, in addition to releasing the initial estimate for the previous month, the BLS also revises estimates for the two months preceding that each month, so that every months' estimates are eventually revised three times. How big are these revisions?

I did some data gathering through ALFRED. I assembled a dataset with both the first estimates and final revisions for all the months, and compared the two:
Histogram of the amount by which the BLS initially underestimated employment growth.
The median error was -6,500 workers, with an interquartile range of -157,500 to 171,000 workers. Note this is the final revision minus the initial estimate, so this means that the BLS is slightly optimistic about US employment growth on average, overestimating it 52 percent of the time.

The most relevant question, however, is whether the initial estimate is a helpful policy guide in real time. To test that, I compared the BLS's initial estimates to an automated ARIMA model optimized based on the Bayesian information criteria that uses historical data that would have been available at the time to forecast three months ahead. Thus the forecast model uses only final revisions available at the time to estimate the each month's jobs figure. The resulting median error was -10,730 with an interquartile range of -147,300 to 126,300.
Amount by which automated ARIMA forecast underestimates employment.
As with the BLS forecast, the automated ARIMA forecast slightly overestimates employment growth most of the time, with forecast exceeding the final numbers about 52 percent of the time.

These distributions are basically identical. Sure, you can spot slight differences in a couple of the moments, and sure the BLS data is ever-so-slightly more accurate on average, but by and large these estimates are pretty much identical. So, in the end, the famed jobs day numbers that cause such a stir on twitter the first Friday of every month probably don't tell us anything we didn't already know.

I didn't have time to examine the numbers in more detail. Is there more of a difference between the two methods at critical points in time such as when we are heading into recession? I assign these kinds of questions to the reader as homework. You can get my data here, my R script here, and this is what I used to extract historical initial estimates and match them to eventual revisions (program written in C#, requires the EPPlus library here).
5/08/2015 01:43:00 PM
That said, for the marginal family on the verge of moving, this looks about right.
A fun and interesting post/app about economic mobility at the NYT Upshot blew up twitter a couple days ago. At least, it blew up my section of twitter, which consists mostly of phd economists tweeting how important this research is. The post was about a paper by Raj Chetty and Nathaniel Hendren, "The Impacts of Neighborhoods on Intergenerational Mobility: Childhood Exposure Effects and County-Level Estimates" which used a unique identification strategy to tease out the causal effect of moving to each county in the US on children's outcomes.

I'm going to make a bold claim: the paper is actually not that important to the general public. It's certainly interesting from an academic perspective, but I'll argue that it has very limited policy implications and the research literature isn't really ready for popular press.

Chetty and Hendren obtained estimates of the effects various counties have on kids outcomes. The interpretation that the NYT Upshot piece drew from this is that if you move to a county like, say, Warren county Ohio, your kids will end up earning $2,500 more per year, on average, whereas moving to Hamilton county Ohio would them to earn $800 less. What policy conclusions can we draw from this? It's unclear. The Upshot also noted that research found that policies that actually paid poor people to move to nicer neighborhoods failed to achieve the desired result, so the policy implication would seem not to encourage people to move from "bad" counties to "good" ones. Even ignoring the previous negative result, there would still be tons of work to do before we begin proposing such a plan--moving people from county to county involves substantially higher costs and more serious consequences than merely paying for apartments in nicer buildings as the previous programs did. More realistically, this paper is merely a continuation of the line of research, and Chetty and Hendren conclude with their recommendation for the next step:
"We hope these data facilitate future work exploring the mechanisms through which neighborhoods have causal effects on intergenerational mobility."
That's a long ways off from any actionable policy advice. More realistically, what Chetty and Hendren have done is constructed an index of county effects on child outcomes, which can then be used in future papers as an instrument to identify things that improve outcomes for poor children. As of right now, we really don't have a good idea on what causes the effects Chetty and Hendren measured, or even if they correctly measured those effects. Hence, the research literature really isn't ripe for public consumption.1

So why were all the twitter economists tweeting how "important" this paper is? That remains unclear to me. As Brian Albrecht pointed out, it's not as if the paper upsets old paradigms No one really doubted that location matters before. What Chetty and Hendren did--something that Chetty is freakishly good at--is found a novel identification strategy for teasing out causal effects from the overall correlation between movement and child outcomes. Mere correlation could have been caused by endogeity bias, but Chetty and Hendren were able to extract a component that we can be reasonably confident is causal.

But "causal" is a nebulous construct. Let's examine their identification strategy. They have a dataset in which families have moved from county to county, and they have data on the child outcomes in these various counties. The problem with simply regressing the one on the other is that people don't just move at random, they move at least in part because of how good they think these counties will be for their children. What Chetty and Hendren did was look instead at the ages of the kids when they moved, and--taken literally--found that the younger the kids are when they moved, the bigger the correlation is between county and their outcomes. Intuitively, if a family has two kids of different ages when they move to a "good" county, the younger child does better later in life than the older child. If you are willing to assume that the reasons for moving to these counties, by and large, is unrelated to the ages of their children, then we can interpret this correlation as the causal effect of the county on child outcomes. If we were just observing upwardly mobile families moving to these counties, instead of the counties making them upwardly mobile, then the correlation should be the same for both the older and younger child. So the estimate is "causal." But what is causing what for whom?

This is far from a clear result. And the results themselves were often very strange: It's quite possible here that Chetty and Hendren have a ceteris parabis problem with their data. The big secret with these causal inference models is that they are all about 10 percent statistics and 90 percent picking the right control group. By using younger and older siblings in families that moved, the paper tries to mimic an RCT where the treatment group is families who moved and the control group is families with identical income who haven't moved yet move to an average county instead (Update: something didn't quite make sense here. ht Salim Furth for the edit). But as Salim Furth noted, there's no particular reason to suppose that moving from a bad county to a good county you'll be able to keep the same income.

Chetty and Hendren needed to include income as a control variable because otherwise they'd have omitted variables bias, because income is correlated both with the reasons people move and child outcomes. Methodologically, this was probably the best approach. But as I've written previously, control variables are a sign of weakness, and can reintroduce endogeneity even in otherwise "clean" identification strategies. In other words, if moving to a good county also causes family income to decline, then it doesn't really make any sense to say that the county "causes" better child outcomes--yes that causal mechanism may be there, but does not exist in isolation.

Even ignoring the issues of endogeity, Furth's critique can be thought of as a question of generality: Chetty and Hendren's dataset includes only families that voluntarily chose to move under the status quo. The fact that they chose to move suggest that they had something to gain from the move, such as, for example, relatives or a lucrative job opportunity--things that would benefit the kids. Take those away, and it's not clear that moving actually does benefit kids. I mean, surely, there exists someone who is better off living in Hamilton county than Warren.

So when it comes to the actual policy question that readers think the Upshot piece was about--should they move to a better county--the answer remains, at best, "It depends" and the truth is that Chetty and Hendren don't shed much light on whether you are one of the families who would benefit or not from moving.


1. What I mean to say here is that the research question remains far from answered. I don't object to popular press readers following along the research process, but it is unfortunate that the Upshot presented this as a settled question when in fact we are only at an intermediate stage.
5/06/2015 09:49:00 AM