There are many problems in economics where a randomized control group cannot be constructed or a seemingly random assignment is fraught with biased introduced by the researcher. A famous example was the Truman vs. Dewey election poll conducted by Gallup Polls.  The poll showed that Dewey would win by a landslide, a statistically significant margin of victory was virtually guaranteed by the statisticians at Gallup.  The election turned out heavily in favor of Truman, by a landslide!  What caused such a huge miss?  The pollsters, well versed in random sampling techniques, started using the telephone to randomly call voters and gather the sample they required for hypothesis testing. The problem was that in 1948, there was still a high correlation between income and telephone ownership.  People who had done well under Dewey had telephones, thus there was a huge bias introduced into the sample that had Dewey winning by a landslide, but the real population of voters that turned out were not like the telephone owners.  Random sampling is crucial to hypothesis testing, but even under the most apparent random assignment one must be weary that some kind of selection biased has not crept in unbeknown to the researcher.

In order to deal with this estimation biased, James Heckman from the University of Chicago devised an ingenious way of dealing with this problem.  His method, which won him the Nobel Prize in 2000, uses a two-stage estimation method to deal with a known selection biased.  The objective of this post is to explain the Heckman Error correction model in the context of labor economics problem which plagued researchers until the formulation of this method.  Before the econometric estimates, background of the problem and a theoretical construct of Heckman’s model will be introduced.

Background on the Selection Problem in the Labor Market

Trying to understand how education and experience are correlated with wages is complicated because of non-random selection in the labor market.  People work because the wages they are offered are greater than what economist call their reservation wage.  The reservation wage is the minimum wage a person would be willing to work for, if wages are below this amount then people would leave the labor market.  This leaves researchers with wages only of those people who are offered higher wages than their reservation wages, but this can introduce non-random selection biased.  Since education and experience are related to the wages people are offered, we are selecting people into the labor market with higher education and experience than what is present in the total population.  The people with less experience and education make up a larger part of those unemployed or completely out of the labor force.  This causes problems when trying to estimate the impact of education and experience on wages, theoretically it would mean that estimates for the correlation between education and experience would be biased upward.  This assertion and result will be tested in the empirical section which follows the theory.

Theory of Sample Selection Biased and Heckman Error Correction Model

Empirical Example in the Labor Market

OLS regression shows that the return for one year of education is approximately 10%

Estimation with the Heckman two-step estimator shows a different picture…

• Notice the upper right hand side of the output; 325 of the women out of 753 are out of the labor force.
• The “Select” section is a probit estimate of observing wages given a set of explanatory variables
• The “lwage” section is the estimate using Heckman; education’s impact on wages has increased by almost a .2 percentage points.
• Notice that the regressors of the “lwage” equation are a strict subset of the “select” equation.
• Lambda is positive as expected; a positive correlation between the error term in the log (wage) equation and the selection equation for wage observation.
• The hypothesis that the coefficient on lambda is not-statistically significant from zero, z = .31, hence the sample selection biased that was expected  did not materialize in this data set.  OLS regression would seem appropriate in this case.