## Guided tour on Heckman's sample selection model

This guided tour contains mathematical formulas and/or Greek symbols and are therefore best viewed with Internet Explorer, Opera or Google Chrome, because other web browsers (in particular Firefox) may not display the "Symbol" fonts involved. For example, "b" should be displayed as the Greek letter "beta" rather than the Roman "b". If not, reload this guided tour in Internet Explorer, Opera or Google Chrome, or make one of them your default web browser.

### The model

Heckman's sample selection model is based on the following two latent variable models:

1. Y1 = b'X + U1
2. Y2 = g'Z + U2
where X is a k-vector of regressors, Z is an m-vector of regressors, possibly including 1's for the intercepts, and the error terms U1 and U2 are jointly normally distributed, independently of X and Z, with zero expectations.

The first model is the model we are interested in. However, the latent variable Y1 is only observed if Y2 > 0. Thus, the actual dependent variable is:

Y = Y1 if Y2 > 0, Y is a missing value if Y2 £ 0.

The latent variable Y2 itself is not observable, but only its sign: We only know that Y2 > 0 if Y is observable, and Y2 £ 0 if not. Consequently, we may without loss of generality normalize U2 such that its variance is equal to 1.

### Sample selection bias

As has been explained in detail in HECKMAN.PDF, if you ignore the sample selection problem and regress Y on X using the observed Y's only, then the OLS estimator of b will be biased, due to the fact that

E[Y1|Y2 > 0, X,Z] = b'X + rsf(g'Z)/F(g'Z),

where F is the cumulative distribution function of the standard normal distribution, f is the corresponding density, s2 is the variance of U1, and r is the correlation between U1 and U2. Hence,

E[Y1|Y2 > 0, X] = b'X + rsE[f(g'Z)/F(g'Z)|X]

The latter term causes the sample selection bias if r is non-zero.

### Maximum likelihood estimation

In order to avoid the sample selection problem, and to get asymptotically efficient estimators, one has to estimate the model parameters by maximum likelihood. In order to do so, we need to derive the conditional density h(y|X,Z,b,g,r,s), say, of Y1 given Y2 > 0, X and Z:

h(y|X,Z,b,g,r,s) = [sF(g'Z)]-1f((y-b'X)/s)F[(r(y-b'X)/s + g'Z)/Ö(1-r2)],

where F and f are the same as before. See HECKMAN.PDF. Next, let Yj, Xj, Zj, j = 1,2,...,n, be the observations on Y, X, and Z, respectively, where some of the Yj's are missing values. Define the dummy variable Dj = 0 if Yj is a missing value, and Dj = 1 if not. Then the log-likelihood function takes the form:

lnL(b,g,r,s) = S j {Dj.ln[h(Yj|Xj,Zj,b,g,r,s)] + Dj.ln[F(g'Zj)] + (1-Dj).ln[1-F(g'Zj)]}.

The details of the maximum likelihood estimation procedure can be found in HECKMAN.PDF.

## Sample selection models in EasyReg

### The data

The data has been generated artificially as follows. First, the independent variables X1,j and X2,j for j = 1,....,n = 500 have been generated as:

1. X1,j = v1,j + 0.5v2,j
2. X2,j = 0.5v1,j + v2,j
where v1,j and v2,j are independent random drawings from the N(0,1) distributed. Next, the latent dependent variables Y1,j and Y2,j for j = 1,....,n = 500 have been generated as:
1. Y1,j = X1,j + X2,j + U1,j
2. Y2,j = 2.X1,j + 2.X2,j + X2,j
where
1. U1,j = 0.4358899e1,j + 0.9e2,j
2. U2,j = e2,j
with e1,j and e2,j independent random drawings from the N(0,1) distribution. The coefficients involved have been chosen such that
• r = corr (U1,j,U2,j) = 0.9
• s2 = var(U1,j) = var(U2,j) = 1.
Finally, the observed dependent variables Yj are now Yj = Y1,j if Y2,j > 0, Yj is a missing value otherwise.

The data file involved is HECKMANDATA.TXT in EasyReg space delimited text format, containing the variables Y = Yj, X1 = X1,j and X2 = X2,j. Since the sample selection module in EasyReg only works for cross-section data, you should declare the data accordingly.

### Sample selection bias

In order to demonstrate the effect of the sample selection bias, regress Y on X1, X2 and the constant 1. EasyReg will automatically skip the (246) observations for which Y is a missing value. Then you will get the following results:

```OLS estimation results
Parameters  Estimate    t-value  H.C. t-value(*)
b(1)         0.88716      9.064            9.088
b(2)         0.89251      9.814           10.802
b(3)         0.35509      3.476            3.810

(*) Based on White's heteroskedasticity consistent variance matrix.
Effective sample size (n) = 244
Standard error of the residuals = 0.970034
R-square = 0.596347
```

Recall that the true values of the parameters are:

• b1 = 1
• b2 = 1
• b3 = 0
Due to the sample selection bias, the joint null hypothesis involved is rejected by the Wald test:
• Wald test on the basis of the standard variance matrix:
```Wald test statistic:                   12.60
Asymptotic null distribution:  Chi-square(3)
p-value = 0.00557
Significance levels:        10%         5%
Critical values:           6.25       7.81
Conclusions:             reject     reject
```
• Wald test on the basis of White's heteroskedasticity consistent variance matrix:
```Wald test statistic:                   14.91
Asymptotic null distribution:  Chi-square(3)
p-value = 0.00190
Significance levels:        10%         5%
Critical values:           6.25       7.81
Conclusions:             reject     reject
```

### Data preparation

As said before, when you select the model variables EasyReg automatically skips the observations containing missing values. In order to get around this problem, you have to transform the Y variable first: Open "Menu > Input > Transform variables" in the EasyReg main window.

Click the "x is missing value -> dummy = 0, x = 0" button and then double click the variable Y:

Click "Selection OK". Then the following two new variables will be created:

Click "OK":

Click "Done".

The variable "Missing is zero[Y]" will now be used as the dependent variable in the sample selection model instead of Y, and the dummy variably "Dummy not missing[Y]" will be used to identify the zero values of "Missing is zero[Y]" as missing values.

Do not rename these variables, because EasyReg will automatically select a matching pair of variables of the type "Missing is zero[Y]" and "Dummy not missing[Y]" as the dependent variables in the sample selection model.

### How to estimate a sample selection model with EasyReg

Now open "Menu > Single equation models > Sample selection (cross-section) models" in the EasyReg main window:

The model variables are X1, X2, Missing is zero[Y] and Dummy not missing[Y]. Double click them and click "Selection OK":

Selection of a subset of observations usually make no sense for sample selection models. Therefore, click "No" and then "Continue":

As said before, EasyReg will automatically select a matching pair of variables of the type "Missing is zero[Y]" and "Dummy not missing[Y]" as the dependent variables in the sample selection model. If there are no, or multiple, matching pairs of these variables, you cannot continue.

Click "Continue":

Next, you have to select the X-variables. EasyReg automatically includes the constant 1 in the list of potential X variables, and preselects it, i.e., the window opens with "* 1". Select the additional X variables X1 and X2, and click "Selection OK". Then the window changes to:

In our case the X and Z variables are the same. Again, EasyReg automatically includes the constant 1 in the list of potential Z variables, and preselects it. Thus, select the additional Z variables X1 and X2, and click "Selection OK". Then the window changes to:

Click "Continue". Then the window changes to:

We are now going to estimate the Probit model for "Dummy not missing[Y]", in order to derive an inital estimate of the parameter vector g = (g1,g2,g3)' in latent variable model Y2 = g'Z + U2. In general there is no need to adjust the stopping rules for the Newton iteration, which is used to conduct the Probit estimation. Thus, click "Continue". Then in a few seconds the Probit estimation results appear:

Note that EasyReg displays model Y2 = g'Z + U2 as Y2= c'Z + U2, where c = (c(1),c(2),c(3))' = g = (g1,g2,g3)'. Moreover, recall that the true values of these parameters are:

• g1 = 2
• g2 = 2
• g3 = 0
Click "Continue":

Note that EasyReg displays model Y1 = b'X + U1 as Y1= b'X + U1, where b = (b(1),b(2),b(3))' = b = (b1,b2,b3)'. Moreover, the correlation r of U1 and U2 is displayed as "r" and the standard error s of U1 by "s". Furthermore, recall that the true values of these parameters are:

• b1 = 1
• b2 = 1
• b3 = 0
• r = 0.9
• s = 1
Given the Probit results, these parameters have been estimated by OLS. The initial estimates involved will now be used as starting values for full information maximum likelihood. See HECKMAN.PDF.

Click "Continue":

Click "Start SIMPLEX iteration":

Following the advice of EasyReg, leave "Auto restart" checked, and click "Restart SIMPLEX iteration". Then click "Done with SIMPLEX iteration":

The asymptotic variance matrix of the ML estimates is usually not of interest, but if you want to print it to the output file, uncheck the box involved.

If some of the parameter estimates involve very large or small numbers (in absolute value), check the box "Display the estimation results in floating point format", otherwise leave it unchecked.

When you click "Continue", the t and p values of the ML estimates will be computed, and if everything goes well (i.e., if the estimated Fisher information matrix is nonsingular), the output will be displayed:

Click "Continue". Then module "NEXTMENU" is activated, which in this case will enable you to conduct the Wald test of linear parameter restrictions, and append the output file OUTPUT.TXT with the output shown below:

Recall that the true values of the parameters are:

• b(1) = b1 = 1
• b(2) = b2 = 1
• b(3) = b3 = 0
• c(1) = g1 = 2
• c(2) = g2 = 2
• c(3) = g3 = 0
• r = r = 0.9
• s = s = 1
The estimated parameters are pretty close to the true values (and not significantly different from the true values at the 5% significance level), but actually all but one of the initial estimates
• b(1) = 1.068566
• b(2) = 1.038234
• b(3) = -0.092985
• c(1) = 2.101653
• c(2) = 2.007804
• c(3) = -0.083671
• r = 0.917299
• s = 1.000785
are closer. However, this seems coincidental.

Since you have seen the Wald test option before in conducting OLS, I will not discuss it here.

### The output

```Heckman's sample selection model:

Latent variable model 1:
Y1 = b'X + U1,
where Y1 = Y
Latent variable model 2:
Y2 = c'Z + U2,
where only the sign of Y2 is observed and Y1 is only observed if Y2 > 0:
Dummy not missing[Y] = 1 if Y2 > 0, else
Dummy not missing[Y] = 0.
The error terms U1 and U2 are jointly normally distributed, and are
independent of X and Z. Moreover, Var(U2) = 1.
Next to the components of b and c there are two additional parameters:
r = the correlation coefficient of U1 and U2
s = the square root of the variance of U1

X variables:
X(1)=X1
X(2)=X2
X(3)=1

Z variables:
Z(1)=X1
Z(2)=X2
Z(3)=1

Chosen sample: Observations 1 to 500
Effective sample size: 500
Frequency of Dummy not missing[Y] = 1: 48.80

Initial Probit estimates of c in latent variable model 2:

Newton iteration succesfully completed after 8 iterations
Last absolute parameter change = 0.0000
Last percentage change of the likelihood = 0.0000

Maximum likelihood estimation results:
Z variables          c(.)  (t-value) [p-value]
Z(1)=X1      2.101653E+00     (7.52) [0.00000]
Z(2)=X2      2.007804E+00     (7.42) [0.00000]
Z(3)=1      -8.367128E-02    (-0.74) [0.45698]
[The two-sided p-values are based on the normal approximation]
Log likelihood: -8.01301859191E+001
Sample size (n): 500

Initial parameter estimates:
b(1) =  1.068566E+00
b(2) =  1.038234E+00
b(3) = -9.298452E-02
c(1) =  2.101653E+00
c(2) =  2.007804E+00
c(3) = -8.367128E-02
r    =  0.000000E+00
s    =  1.000785E+00

The Log-likelihood has been maximized using the simplex method of Nelder
and Mead. The algorithm involved is a Visual Basic translation of the
Fortran algorithm involved in:
Press, W.H., B.P.Flannery, S.A.Teukolsky and W.T.Vetterling (1986):
'Numerical Recipes', Cambridge University Press, pp. 292-293

Full information maximum likelihood estimation results:
Parameters ML estimates t-value [p-value]
b(1)           1.096625   9.679 [0.00000]
b(2)           1.025825   9.175 [0.00000]
b(3)          -0.112768  -0.621 [0.53480]
c(1)           1.870422   4.812 [0.00000]
c(2)           2.035303   5.583 [0.00000]
c(3)          -0.063557  -0.381 [0.70318]
r              0.945032   0.786 [0.43205]
s              0.998827  16.626 [0.00000]
[The two-sided p-values are based on the normal approximation]
Log-Likelihood = -401.49387556627
n = 500
Information criteria:
Akaike:               1.637975502
Hannan-Quinn:         1.664436388
Schwarz:              1.705409232
```