Tuesday, December 16, 2014

How to use Google reCaptcha

I recently added a link so readers can email me. However, to prevent spam, I made the email link a webform controlled by google's reCaptcha system, which they recently revamped to be (allegedly) easier for humans and harder for spambots. It now looks like this:
Now you just have to check the box, and it may or may not prompt you with a challenge word depending on whether google thinks you might be a robot.
The system really is pretty easy to use, but even so I found a frustrating lack of up-to-date information on how to do this in the asp.net framework with C#. All of the forums I found were either way more complicated than they need to be, did not have up-to-date information, or did not fully explain how to do it. So here's my attempt to fill that gap:

To start, you need an aspx page containing your web form. I'll assume that you already know the basics of how to make an aspx web form(in Visual Studio, most likely), set up event handlers, do postbacks, etc from the code-behind. I won't assume anything more than a beginner's understanding of these, but if none of that sounds familiar, you need to start with an asp.net tutorial instead of what follows. Also, I'm doing this all in Visual Studio 2010 with C#. So our aspx page code looks like this:
A basic webform with three server controls: a label that says "label:", a text box where users can enter in some text, and a submit button.
The idea with this form is that users will type something in (an email message, for example), click submit, and then you can do something with that user input on the server side (send it as an email to someone, perhaps, or maybe save the text into a database).

The problem is that this form is accessible to both humans (good) and spambots (bad), so we need to add a reCaptcha to prevent spammers from being able to programmatically use these controls. To do that, you first need to go to google to get set up with reCaptcha. It is a free service, but you need to create an account and get three things: a script tag that looks like this:
<script src="https://www.google.com/recaptcha/api.js" async defer></script>
a div that looks like this:
<div class="g-recaptcha" data-sitekey="your_site_key"></div>
(but will have your private key instead of your_site_key), and of course, a private key, which you will also need for the server-side code. We insert the script tag into the head of our aspx page and the div into the form like so:
Note: Visual Studio may complain about the div above, which has an attribute the server won't recognize. Ignore it--the user's browser will know what to do.
It's important to note that I've added an OnClick event handler to the Submit button, which calls the function myFunction, which we will be adding to the codeBehind file shortly. Also note that I set the reCaptcha div to runat="server". This is what our aspx page will look like to users:
An aspx form with a reCaptcha
Now we need to head to the codeBehind file.

The code behind has a Page_Load event handler by default. We won't be using it. Below it, we'll add three functions: one of them is the myFunction that is being called when the Submit button is clicked, and the other two will get the users' IP address and check to see if the reCaptcha validated, respectively. Additionally, we will be adding three using statements to the top: two are for System.Net and System.IO respectively, which are part of the standard library, and the third is Recaptcha which is not. You'll need to download the Recaptcha library here, extract it from the zip file, and add a reference to the library from your IDE, which is different than just adding the using statement (in Visual Studio, in the Solution Explorer right click References, then click Add Reference, go to the Browse tab, and point it to the location of the file you just downloaded.) So we have a code behind skeleton that looks like this:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using Recaptcha;
using System.Net;
using System.IO;

namespace emailApp
{
    public partial class WebForm1 : System.Web.UI.Page
    {
        protected void Page_Load(object sender, EventArgs e)
        {

        }

        bool reCaptchaValidate(string ipAddress)
        {
            bool Valid = false;
            return Valid;
        }

        string getIpAddress()
        {
           
        }
        protected void myFunction(object sender, EventArgs e)
        {

        }
    }
}
Here's the guts of the method to get the user's IP address:
        string getIpAddress()
        {
            System.Web.HttpContext context = System.Web.HttpContext.Current;
            string ipAddress = context.Request.ServerVariables["HTTP_X_FORWARDED_FOR"];

            if (!string.IsNullOrEmpty(ipAddress))
            {
                string[] addresses = ipAddress.Split(',');
                if (addresses.Length != 0)
                {
                    return addresses[0];
                }
            }

            return context.Request.ServerVariables["REMOTE_ADDR"];
        }
You needn't worry too much how this works, it is just a generic method that attempts to grab the user's IP address and returns that as a string variable. It's a bit messy because it isn't in general possible to get the user's IP address if, for example, the user is using a proxy server. Sometimes IP addresses are forwarded by proxy servers, so this looks for that if it exists, but it won't always work and that's ok. ReCaptcha uses IP address as one of it's criteria to determine if someone is a bot, but your app will still work fine even when we can't obtain the correct IP for some users.

Next, we fill in the method that will check the reCaptcha to see if the user passed the test. It is as follows:
public bool reCaptchaValidate(string ipAddress)
        {
            bool Valid = false;
            string Response = Request["g-recaptcha-response"];//Getting Response String Append to Post Method
            string url = @"https://www.google.com/recaptcha/api/siteverify?secret=your_site_key&response=" + Response + @"&remoteip=" + ipAddress;
            //Request to Google Server
            HttpWebRequest req = (HttpWebRequest)WebRequest.Create(url);
            try
            {
                //Google recaptcha Response
                using (WebResponse wResponse = req.GetResponse())
                {
                    using (StreamReader readStream = new StreamReader(wResponse.GetResponseStream()))
                    {
                        string jsonResponse = readStream.ReadToEnd();
                        if (jsonResponse.Substring(15, 4) == "true")
                        {
                            Valid = true;
                        }
                    }
                }

                return Valid;
            }
Ok, there's some stuff going on in there. The first thing that happens is we get the user's response to the reCaptcha using Request["g-recaptcha-response"]. This sends an http request asking for information pertaining to g-recaptcha-response, which is a property google built for the reCaptcha object. The next thing to note is the URL. We are performing a standard GET http web request, and the way that works is we send out a URL loaded with data, which will find it's way to that server, which will then send back a response based on the data we included in the URL (side note: the + signs concatenate strings, which is how we insert variables into other strings). You'll notice a question mark in the middle of the URL above--everything before the ? is an address to the server we want the response from, which is Google in our case, and everything after ? specifies parameters which Google will use to determine what response to send. There are three parameters: the first is your private key which google gave you earlier when you signed up for reCaptcha: enter that into the URL string. We've already inserted the variable for the user's reCaptcha response into the URL, and when we pass an IP address to this function, that will also get spliced into this URL.

The next statement after the URL in our function preps an http request with the URL we've specified, and the GetResponse() will fire off our request and return the server's reply as a WebResponse object. We use the StreamReader to parse a string out of the WebResponse reply. The using() syntax isn't totally necessary, and neither is the try{}catch{} syntax. This extra verbiage is just in there to minimize and deal with potential errors, which may or may not be an issue depending how your application works. The response string will be formatted as a json object, like so:
{
  "success": true|false,
  "error-codes": [...]   // optional
}
But all we want is whether success: is followed by true or false, so we extract that using the Substring() method and test whether the result is equal to "true". If it is, then the user is probably not a bot so we set Valid=true which will allow the rest of the application to execute. Otherwise, we will assume it's a bot, and refuse to execute the rest of the code.

But so far, our webform does nothing. That's because the we've not put anything into the myFunction method, and thus nether of the methods above are being called. So here's how we do that:
protected void myFunction(object sender, EventArgs e)
        {
            if (reCaptchaValidate(getIpAddress()))
            {
                string userInput=Server.HtmlEncode(TextBox1.Text);
                //do some stuff with userInput
            }

        }
myFunction will be called when the user clicks the Submit button. In the if() statement, the method will call the getIpAddress() method and send the returned value as the input parameter for the reCaptchaValidate() method, and if that returns true will execute the code inside the curly brackets. So far, the only thing happening in there is we grab the user's text from TextBox1, scrub it using the HtmlEncode command which converts all special characters like angle brackets into special HTML "entities", which is a useful step in making sure that users cannot inject malicious code into your input box. After that, you can do whatever you want with the data. Anything inside the if statement will be executed only if the user passes the reCaptcha test.

Monday, December 15, 2014

Even if you don't qualify for ACA subsidies, you probably still benefit from them

I explained why over at The Incidental Economist:
"with these healthier (less costly) individuals dropping out of the market, adverse selection leads to inefficiently low levels of insurance and higher insurance premiums.

A subsidy like the ones offered on the ACA exchanges shifts the demand curve up, so that the premium at which individuals will buy insurance equals their willingness-to-pay plus the subsidy amount. This, in turn, shifts the intersection between demand and AC to the right, in the direction of more coverage and lower average cost, lowering premiums for everyone.

The classical model where subsidies increase prices relied on having an upward-sloping marginal cost curve. The forces that cause upward-sloping MC curves in most markets didn’t totally disappear for health insurance markets–we can imagine that loading costs generally increase as more people buy insurance–but they are dwarfed by the size and heterogeneity in actual healthcare expenses, producing an overall downward-sloping MC curve. Thus, when in the presence of adverse selection, subsidies not only reduce the cost of insurance to those who receive them, but to everyone in the market."
You can find the whole thing here. To illustrate the theory, I also have a (very simple) algebraic model of adverse selection here.

Monday Morning Music

Tuesday, December 9, 2014

Does ideology influence economists' work?

Sure. There are ways in which we cannot help but let ideological considerations affect our work. A researcher must, for example, pick a research question to study, and the questions that sound interesting and important sound that way because of their ideological priors. And for many things economists are asked to do--like, say, derive an optimal tax policy--that cannot be completely separated from a subjective ideological determination. What welfare function do we optimize the tax code against? What criteria define the "optimal" condition? Despite a century of effort, we've not found a truly scientific way to answer these questions--individual-level choices do not aggregate in ways that let us use the standard revealed-preference framework to evaluate these issues.

When you add to that a certain amount of publication bias--not only will liberals study liberal policies and conservatives study conservative policies, but both will only publish those with statistically significant effects--it would be pretty surprising if published results didn't vary depending on the ideologies of authors. Yet, as Kevin Drum points out, that's essentially what we see:
Scatterplot of study findings versus constructed index measure of researcher's ideology.
Now, that's not what the authors of the graph above claim: they listed this graph as evidence of pervasive ideological bias in economists' work. I don't see it. Noah Smith says this is all about the R-square, not the coefficient, and he's absolutely right.

Let me illustrate that point with a different graph. Does this regression line still look like it fits that data?
The same graph as above, with two datapoints deleted.
It does not. In fact, it looks like the regression line should actually be downward sloping! Obviously, this is just the same graph as above, and the only thing I've done is removed two datapoints. They were the two most extreme observations in the graph--the highest and the lowest of the reported normalized elasticities.

A result that depends on two out of many data points is not a result. My guess is that the p-value was small enough to be statistically significant only because of the way least-squares estimation gives disproportionately greater weight to extreme values than the ones in the middle of the pack, and those extreme values happened to fall pretty far out on their measure of ideology (which, by the way, correctly predicts authors' ideology less than 75% of the time anyway). The term statisticians often use is "robust"--their estimates are not robust, even if the p-value was small. I don't have the raw data to confirm, but it looks to me that as far as the non-outlier data is concerned, there does not appear to be a correlation here.

Side note: Brian Albrecht is right that it doesn't make any sense to claim the slope of the regression line is "small." We're correlating two constructed measures, so the magnitude of the slope has no discernible interpretation. But, the point is that an effect of this size cannot explain a noticeable fraction of the total variation in economists' results.

Friday, December 5, 2014

Some algebra on adverse selection in insurance markets

Here's an extremely simple model of adverse selection in an insurance market. I have in mind health insurance, but nothing about this example is specific to health.

There are two types of individuals in this health insurance market, indexed by [$]i\in\left\{H,L\right\}[$] where [$]H[$] denotes those with a high probability [$]p_H[$] of incurring a hospitalization cost [$]h \gt 0[$] and [$]L[$] denotes those with a low probability [$]p_L[$] of hospitalization. We'll assume there are equal numbers of each, and represent them as two representative individuals. Let [$]C_i[$] denote individual [$]i[$]'s consumption of all non-hospital goods and services, and [$]y_i[$] is income. Each individual seeks to maximize expected utility from non-medical consumption according to the risk-averse utility function [$$]U_i=E_i\left[ln\left(C_i\right)\right][$$] subject to the budget constraint, which is [$]C_i\leq y_i-h[$] if they get sick without insurance, or [$]C_i\leq y_i[$] if they do not buy insurance and don't get sick, or [$]C_i\leq y_i-m_i[$] if they buy insurance at premium [$]m_i[$] whether or not they get sick. This means expected utility without insurance is [$$]\left(1-p_i\right) ln\left(y\right)+p_i ln\left(y-h\right)[$$] versus [$$]ln\left(y-m_i\right)[$$] if he does buy insurance. The insurer is perfectly competitive with no loading costs, so the premium is equal to the insurers' expected costs for each plan.

It is Pareto Efficient for both individuals to buy insurance. To see that result, set prices [$]m_H=p_hh[$], and [$]m_L=pLh[$], and note that \begin{align*} \left(1-p_H\right) ln\left(y\right)+p_H ln\left(y-h\right)&\leq ln\left(y-m_H\right) \\ \left(1-p_L\right) ln\left(y\right)+p_L ln\left(y-h\right)&\leq ln\left(y-m_L\right) \end{align*} With strict inequality guaranteed by the risk-aversion of each individuals (ie, ln is a concave function). Thus, there exists prices at which insurer's profit-maximization conditions are satisfied and both individuals are better off with insurance than without. Not only is this Pareto Efficient, but this is also the competitive equilibrium whenever the risk types [$]p_L,p_H[$] of every individual are publicly known and insurers are allowed to discriminate on the basis of that information (Aka risk-rating).

Suppose that the individuals know their own risk types but that the insurer either doesn't know which is which (asymmetric information), or is prohibited from discriminating on the basis of risk type (community-rating). Either way, the insurer is ultimately forced to offer the same premium to both individuals, and the same profit-maximization that the premium equals the expected cost of the plan still applies. In this case, there are two possible types of equilibrium.

Pooling Equilibrium

Both individuals will buy insurance at the price [$]m\equiv m_H=m_L=\frac{p_L+p_H}{2}h[$] if \begin{align} \left(1-p_H\right) ln\left(y\right)+p_H ln\left(y-h\right)&\leq ln\left(y-\frac{p_L+p_H}{2}h\right), and \label{htype}\\ \left(1-p_L\right) ln\left(y\right)+p_L ln\left(y-h\right)&\leq ln\left(y-\frac{p_L+p_H}{2}h\right) \end{align} and you can verify that the insurer's profit conditions are also satisfied. Note that the right hand sides are the same, and the left hand side is bigger for the [$]L[$] type, so this says we end up at a pooling equilibrium if [$]p_H[$] is not too much larger than [$]p_L[$], thus achieving an equilibrium that is also Pareto Efficient.

Separating Equilibrium

Equation \eqref{htype} is always true because the pooled premium is actually less than the [$]H[$] type's expected cost, so this result follows directly from the definition of risk-aversion. However, for [$]p_H[$] that is too much larger than [$]p_L[$] we could have [$$]\left(1-p_L\right) ln\left(y\right)+p_L ln\left(y-h\right)\gt ln\left(y-\frac{p_L+p_H}{2}h\right)[$$] because the pooled premium is higher than the [$]L[$] type's expected cost, and there is a limit to how large a risk premium individuals are willing to pay above their expected costs. If this happens, then we end up in an equilibrium where [$]m=p_Hh[$] and only the [$]H[$] type buys insurance. This equilibrium is not Pareto Efficient because the social costs of giving the [$]L[$] are less than what he's willing to pay, and yet the equilibrium fails to give him insurance.

Numerical Example

To prove that the above is non-vacuous, consider a numerical example: let [$]y=100[$], [$]h=50[$], and [$]p_L=0.1[$]. The graph below shows how the utilities for each with and without insurance changes as [$]p_H[$] (plotted on x-axis) changes:
The horizontal axis gives values of [$]p_H[$] while the y-axis plots the utility as given by the functions above.
If both individuals buy insurance, then their utilities will be the same, given by the red line in the graph. The [$]H[$] type's utility if he doesn't buy insurance is the blue line, which is horizontal because with him out of the market, his utility does not depend on the [$]H[$] type's risk. Notice that for [$]p_H \gt 0.17[$] the blue line is higher than the red line, implying that the [$]L[$] type is better off not having insurance than paying the high pooled risk-premium. However, the [$]H[$] type's utility without insurance, denoted by the green line, is everywhere less than his utility in the pooling equilibrium.

Thus, if [$]p_H \gt 0.17[$] this market has adverse selection resulting in inefficiently low coverage rates. One solution for this is to subsidize. Suppose [$]p_L=0.1[$] and [$]p_H=0.2[$], then without any intervention we have adverse selection as the low risk type drops out. Importantly, notice that the insurance premium for those who continue to buy insurance, [$]m=10[$] is higher than if the low risk type had not dropped out, which would have been [$]m=7.5[$]. It does not take much of a subsidy here to induce the low type to buy: a subsidy of just 0.81 is more than enough, and in fact is also low enough that the [$]H[$] type would still be better off even if he has to pay the taxes to fund the subsidies--a cost to him of 8.31 instead of 10. Hence, a tax-and-subsidy scheme to induce everyone to buy insurance does make every person individually better off in the presence of adverse selection. With the subsidies, the market is once again Pareto Efficient.