This guided tour contains mathematical formulas and/or Greek symbols and are therefore best viewed with Internet Explorer, Opera or Google Chrome, because other web browsers (in particular Firefox) may not display the "Symbol" fonts involved. For example, "b" should be displayed as the Greek letter "beta" rather than the Roman "b". If not, reload this guided tour in Internet Explorer, Opera or Google Chrome, or make one of them your default web browser.

Heckman's sample selection model is based on the following two latent variable models:

*Y*_{1}= b'*X*+*U*_{1}*Y*_{2}= g'*Z*+*U*_{2}

The first model is the model we are interested in. However, the latent variable *Y*_{1} is only observed if Y_{2} > 0. Thus,
the actual dependent variable is:

The latent variable *Y*_{2} itself is not observable, but only its sign:
We only know that *Y*_{2} > 0 if *Y* is observable, and
*Y*_{2} £ 0 if not. Consequently, we may without
loss of generality normalize *U*_{2} such that its variance is equal to 1.

As has been explained in detail in HECKMAN.PDF, if you ignore the sample selection problem and regress *Y* on *X* using the observed *Y*'s only, then the OLS estimator of b will be biased,
due to the fact that

where *F* is the cumulative distribution function of the standard normal distribution, *f* is the corresponding density,
s^{2} is the variance of *U*_{1}, and r is the correlation
between *U*_{1} and *U*_{2}.
Hence,

The latter term causes the sample selection bias if r is non-zero.

In order to avoid the sample selection problem, and to get asymptotically efficient estimators,
one has to estimate the model parameters by maximum likelihood. In order to do so, we need to derive the conditional density
*h*(y|*X*,*Z*,b,g,r,s),
say, of *Y*_{1} given *Y*_{2} > 0,*X* and *Z*:

where *F* and *f* are the same as before. See HECKMAN.PDF.
Next, let *Y _{j}*,

The details of the maximum likelihood estimation procedure can be found in HECKMAN.PDF.

The data has been generated artificially as follows.
First, the independent variables *X*_{1,j} and *X*_{2,j} for *j* = 1,....,*n* = 500 have been generated as:

*X*_{1,j}=*v*_{1,j}+ 0.5*v*_{2,j}*X*_{2,j}= 0.5*v*_{1,j}+*v*_{2,j}

*Y*_{1,j}=*X*_{1,j}+*X*_{2,j}+*U*_{1,j}*Y*_{2,j}= 2.*X*_{1,j}+ 2.*X*_{2,j}+*X*_{2,j}

*U*_{1,j}= 0.4358899*e*_{1,j}+ 0.9*e*_{2,j}*U*_{2,j}=*e*_{2,j}

- r =
*corr*(*U*_{1,j},*U*_{2,j}) = 0.9 - s
^{2}=*var*(*U*_{1,j}) =*var*(*U*_{2,j}) = 1.

The data file involved is HECKMANDATA.TXT in
EasyReg space delimited text format, containing the variables
Y = *Y*_{j}*X*_{1,j}*X*_{2,j}

In order to demonstrate the effect of the sample selection bias, regress Y on X1, X2 and the constant 1. EasyReg will automatically skip the (246) observations for which Y is a missing value. Then you will get the following results:

OLS estimation results Parameters Estimate t-value H.C. t-value(*) b(1) 0.88716 9.064 9.088 b(2) 0.89251 9.814 10.802 b(3) 0.35509 3.476 3.810 (*) Based on White's heteroskedasticity consistent variance matrix. Effective sample size (n) = 244 Standard error of the residuals = 0.970034 R-square = 0.596347 Adjusted R-square = 0.592997

Recall that the true values of the parameters are:

- b
_{1}= 1 - b
_{2}= 1 - b
_{3}= 0

- Wald test on the basis of the standard variance matrix:

Wald test statistic: 12.60 Asymptotic null distribution: Chi-square(3) p-value = 0.00557 Significance levels: 10% 5% Critical values: 6.25 7.81 Conclusions: reject reject

- Wald test on the basis of White's heteroskedasticity consistent
variance matrix:

Wald test statistic: 14.91 Asymptotic null distribution: Chi-square(3) p-value = 0.00190 Significance levels: 10% 5% Critical values: 6.25 7.81 Conclusions: reject reject

As said before, when you select the model variables EasyReg automatically skips the observations containing missing values. In order to get around this problem, you have to transform the Y variable first: Open "Menu > Input > Transform variables" in the EasyReg main window.

Click the "x is missing value -> dummy = 0, x = 0" button and then double click the variable Y:

Click "Selection OK". Then the following two new variables will be created:

Click "OK":

Click "Done".

The variable "Missing is zero[Y]" will now be used as the dependent variable in the sample selection model instead of Y, and the dummy variably

Do not rename these variables, because EasyReg will automatically select a matching pair of variables of the type

Now open "Menu > Single equation models > Sample selection (cross-section) models" in the EasyReg main window:

The model variables are X1, X2,

Selection of a subset of observations usually make no sense for sample selection models. Therefore, click "No" and then "Continue":

As said before, EasyReg will automatically select a matching pair of variables of the type

Click "Continue":

Next, you have to select the X-variables. EasyReg automatically includes the constant 1 in the list of potential X variables, and preselects it, i.e., the window opens with "* 1". Select the additional X variables X1 and X2, and click "Selection OK". Then the window changes to:

In our case the X and Z variables are the same. Again, EasyReg automatically includes the constant 1 in the list of potential Z variables, and preselects it. Thus, select the additional Z variables X1 and X2, and click "Selection OK". Then the window changes to:

Click "Continue". Then the window changes to:

We are now going to estimate the Probit model for _{1},g_{2},g_{3})'*Y*_{2} = g'*Z* + *U*_{2}.

Note that EasyReg displays model *Y*_{2} = g'*Z* + *U*_{2}_{1},g_{2},g_{3})'.

- g
_{1}= 2 - g
_{2}= 2 - g
_{3}= 0

Note that EasyReg displays model *Y*_{1} = b'*X* + *U*_{1}_{1},b_{2},b_{3})'.

- b
_{1}= 1 - b
_{2}= 1 - b
_{3}= 0 - r = 0.9
- s = 1

Click "Continue":

Click "Start SIMPLEX iteration":

Following the advice of EasyReg, leave "Auto restart" checked, and click "Restart SIMPLEX iteration". Then click "Done with SIMPLEX iteration":

The asymptotic variance matrix of the ML estimates is usually not of interest, but if you want to print it to the output file, uncheck the box involved.

If some of the parameter estimates involve very large or small numbers (in absolute value), check the box "Display the estimation results in floating point format", otherwise leave it unchecked.

When you click "Continue", the t and p values of the ML estimates will be computed, and if everything goes well (i.e., if the estimated Fisher information matrix is nonsingular), the output will be displayed:

Click "Continue". Then module "NEXTMENU" is activated, which in this case will enable you to conduct the Wald test of linear parameter restrictions, and append the output file OUTPUT.TXT with the output shown below:

Recall that the true values of the parameters are:

- b(1) = b
_{1}= 1 - b(2) = b
_{2}= 1 - b(3) = b
_{3}= 0 - c(1) = g
_{1}= 2 - c(2) = g
_{2}= 2 - c(3) = g
_{3}= 0 - r = r = 0.9
- s = s = 1

- b(1) = 1.068566
- b(2) = 1.038234
- b(3) = -0.092985
- c(1) = 2.101653
- c(2) = 2.007804
- c(3) = -0.083671
- r = 0.917299
- s = 1.000785

Since you have seen the Wald test option before in conducting OLS, I will not discuss it here.

Heckman's sample selection model: Latent variable model 1: Y1 = b'X + U1, where Y1 = Y Latent variable model 2: Y2 = c'Z + U2, where only the sign of Y2 is observed and Y1 is only observed if Y2 > 0: Dummy not missing[Y] = 1 if Y2 > 0, else Dummy not missing[Y] = 0. The error terms U1 and U2 are jointly normally distributed, and are independent of X and Z. Moreover, Var(U2) = 1. Next to the components of b and c there are two additional parameters: r = the correlation coefficient of U1 and U2 s = the square root of the variance of U1 X variables: X(1)=X1 X(2)=X2 X(3)=1 Z variables: Z(1)=X1 Z(2)=X2 Z(3)=1 Chosen sample: Observations 1 to 500 Effective sample size: 500 Frequency of Dummy not missing[Y] = 1: 48.80 Initial Probit estimates of c in latent variable model 2: Newton iteration succesfully completed after 8 iterations Last absolute parameter change = 0.0000 Last percentage change of the likelihood = 0.0000 Maximum likelihood estimation results: Z variables c(.) (t-value) [p-value] Z(1)=X1 2.101653E+00 (7.52) [0.00000] Z(2)=X2 2.007804E+00 (7.42) [0.00000] Z(3)=1 -8.367128E-02 (-0.74) [0.45698] [The two-sided p-values are based on the normal approximation] Log likelihood: -8.01301859191E+001 Sample size (n): 500 Initial parameter estimates: b(1) = 1.068566E+00 b(2) = 1.038234E+00 b(3) = -9.298452E-02 c(1) = 2.101653E+00 c(2) = 2.007804E+00 c(3) = -8.367128E-02 r = 0.000000E+00 s = 1.000785E+00 The Log-likelihood has been maximized using the simplex method of Nelder and Mead. The algorithm involved is a Visual Basic translation of the Fortran algorithm involved in: Press, W.H., B.P.Flannery, S.A.Teukolsky and W.T.Vetterling (1986): 'Numerical Recipes', Cambridge University Press, pp. 292-293 Full information maximum likelihood estimation results: Parameters ML estimates t-value [p-value] b(1) 1.096625 9.679 [0.00000] b(2) 1.025825 9.175 [0.00000] b(3) -0.112768 -0.621 [0.53480] c(1) 1.870422 4.812 [0.00000] c(2) 2.035303 5.583 [0.00000] c(3) -0.063557 -0.381 [0.70318] r 0.945032 0.786 [0.43205] s 0.998827 16.626 [0.00000] [The two-sided p-values are based on the normal approximation] Log-Likelihood = -401.49387556627 n = 500 Information criteria: Akaike: 1.637975502 Hannan-Quinn: 1.664436388 Schwarz: 1.705409232