Guided tour on two-stage least squares estimation

This guided tour contains mathematical formulas and/or Greek symbols and are therefore best viewed with Internet Explorer, Opera or Google Chrome, because other web browsers (in particular Firefox) may not display the "Symbol" fonts involved. For example, "b" should be displayed as the Greek letter "beta" rather than the Roman "b". If not, reload this guided tour in Internet Explorer, Opera or Google Chrome, or make one of them your default web browser.

The model

A typical example of a model for which Two-Stage Least Squares (TSLS) is applicable is the first equation of the following system of equations:

y1,t = a'X1,t + b'Y2,t + u1,t ,

Y2,t = G1X1,t + G2X2,t + U2,t ,

t = 1,2,3,.....,n ,

where y1,t is a scalar dependent variable, Y2,t is a vector of other dependent variables, X1,t is a vector of common exogenous variables, possibly including 1 for the constant terms, and X2,t is a vector of additional explanatory variables for Y2,t.

The equation for y1,t is the first equation of a classical simultaneous equations system

B.Yt = GXt + Ut,

where Yt = (y1,t,Y2,t' )' and Xt = (X1,t',X2,t' )'. The equation for Y2,t is just the corresponding part of the reduced form equation

Yt = B-1GXt + B-1Ut.
The equation for y1,t is the one we are interested in.

If the error term u1,t is correlated with the components of the error vector U2,t then E[u1,tY2,t' ] 0', which violates one of the basic assumptions of the classical linear regression model. Consequently if you estimate the parameter vectors a and b by OLS, then the estimates involved will be inconsistent.

Instrumental variables estimation

Model y1,t = a'X1,t + b'Y2,t + u1,t can be written compactly as

y = Xq + u,

where y is the vector of stacked variables y1,t for t = 1,2,...,n, X is the matrix with rows (X1,t',Y2,t' )' for t = 1,2,...,n, u is the vector of stacked errors u1,t for t = 1,2,...,n, and q = (a',b' )'. Moreover, let Z be the matrix with rows (X1,t',X2,t' )' for t = 1,2,...,n.

As motivated in the previous section, the error vector u satisfies

E[X'u] 0, but E[Z'u] = 0.

Due to the latter, and some further regularity conditions, the parameter vector q can be estimated consistently and asymptotic normally by the Instrumental Variables (IV) approach, using Z as the matrix of instrumental variables. The IV approach is a special case of the Method of Moments (MM) approach. As explained in my lecture notes on the Method of Moments (and in most intermediate econometrics textbooks as well), the IV estimator qn of q takes the form

qn = (X'PZX)-1X'PZ y,

where

PZ = Z(Z'Z)-1Z'.

Of course, we have to require that the matrix X'PZX is nonsingular. A necessary (but not sufficient!) condition for this is that:

The number of variables in X2,t is greater or equal to the number of variables in Y2,t.

Under regularity conditions, the IV estimator qn is asymptotically normally distributed:

n (qn - q) N[0,s2W]

in distribution, where s2 is the variance of u1,t and

W = plimnn(X'PZX)-1.

Two-stage least squares

The IV estimator qn is also called the TSLS estimator because it can be derived alternatively in the following two steps.

(1) Project linearly the columns of the matrix X on the space spanned by the columns on the matrix Z. The linear projection involved is the matrix PZX. Note that linear projection is just regression. Regress column Xi of X on Z, i.e., estimate the linear regression model Xi = Zdi + v by OLS. Then the linear projection of Xi on Z is Z.di, where di = (Z'Z)-1Z'Xi is the OLS estimator of di. The matrix PZX is now the matrix with columns Z.di.

(2) Regress y on PZX. Then the OLS estimator of the parameter vector involved is just the IV estimator qn.

Two-stage least squares estimation with EasyReg International

The data have been generated artificially, as

Y1 = Y2 + X1 + X2 + U1

Y2 = X1 + X2 + X3 + X4 + U1 + U2

where X1, X2, X3, and X4 have been drawn independently from the N(0,2) distribution, and U1 and U2 have been drawn independently from the N(0,1) distribution. 500 observations on Y1,Y2,X1, X2, X3, and X4 have been generated this way. The data involved is available as file TSLSDATA.CSV in Excel CSV format (US number setting).

The procedure for the selection of the variables in the TSLS model is similar to OLS, except that now also the instrumental variables have to be selected as X variables:

TSLS Window 1

Next, you have to indicate which explanatory variables are endogenous variables. In this case the only endogenous X variable is Y2:

TSLS Window 2

Now you have to remove at least as many exogenous variables from the list as there are endogenous X variables. The variables to be removed are X3 and X4:

TSLS Window 3

Once you click "Exogenous variables OK" the window changes to:

TSLS Window 4

Click "Continue". Then the output appears:

TSLS Window 5

Recall that the actual data generating process is Y1 = Y2 + X1 + X2 + U1. Thus, the TSLS parameter estimates are pretty close to the true values.

Click "Continue". Then the NEXTMENU window appears, which provides further options. These options have already been discussed in the guided tour on OLS estimations, and will therefore not be discussed again.

TSLS Window 6

This is the end of the guided tour on two-stage least squares estimation.