This guided tour contains mathematical formulas and/or Greek symbols and are therefore best viewed with Internet Explorer, Opera or Google Chrome, because other web browsers (in particular Firefox) may not display the "Symbol" fonts involved. For example, "b" should be displayed as the Greek letter "beta" rather than the Roman "b". If not, reload this guided tour in Internet Explorer, Opera or Google Chrome, or make one of them your default web browser.
In this guided tour I will explain in detail how to conduct OLS estimation, and in particular how to interpret the estimation and test results.
I have gotten many email from EasyReg users with limited or even no econometric knowledge, asking me elementary econometric questions for which the answers can be found in any decent undergraduate econometrics textbook. This guided tour is intended to answer all the elementary econometric questions you may have. If you still don't understand linear regression, please read a good undergraduate econometric textbook first rather than asking me! I recommend the following reading list.
The classical linear regression model takes the form
y_{j} = b_{1}x_{1,j} + ... + b_{k }x_{k,j } + u_{j}, j = 1,...,n,
where:
y_{j} = b_{1}x_{1,j} + ... + b_{k-1} x_{k-1,j}_{} + b_{k} + u_{j}.
The parameter
b_{k}
is called the intercept, and the parameters
In general you should always include an intercept in your model, except if economic theory prescribes otherwise (which is rare). The reason is that
b_{k} = E(y_{j}) - b_{1}E(x_{1,j}) - ... - b_{k-1}E(x_{k-1,j}),
so that b_{k}
picks up the effect of the unconditional expectations of the model variables.
There is usually no economic reason to believe that b_{k}
= 0. Therefore, in the discussion below I will focus on the model with
an intercept, i.e., I will assume that
The model parameters b_{i} are estimated by minimizing the sum of squared residuals:
RSS = S_{j} ( y_{j} - b_{1}x_{1,j} - ... - b_{k}_{ }x_{k,j})^{2} = min _{b1,...bk} S_{j} ( y_{j} - b_{1}x_{1,j} - ... - b_{k }x_{k,j})^{2},
where RSS stands for: Residual Sum of Squares (aka SSR = Sum of Squared Residuals). Moreover, the error variance s^{2} is estimated by
s^{2} = RSS / (n - k).
The square root of the estimated error variance, s, is called the standard error of the residuals.
Under some regularity conditions we have:
Also the p-value can be used to test the null hypothesis
The R^{2} compares the RSS of the regression model under review with the RSS of the "model"
y_{j} = a + u_{j}.
The latter RSS is called the Total Sum of Squares (TSS). The OLS estimate of a is
a = (1/n)S_{j}y_{j},
which is the sample mean of the y_{j}'s. Thus,
TSS = S_{j}(y_{j} - a)^{2}.
The R^{2} is now defined as
R^{2} = 1 - RSS / TSS.
Note that the R^{2} can only be interpreted as a measure of the contribution of the explanatory variables to the explanation of y_{j} if the regression model contains an intercept, as otherwise one would compare apples and oranges. Nevertheless, EasyReg also computes the R^{2} if the model does not contain an intercept, because otherwise I would get too many emails from EasyReg users asking where the R^{2} is.
The larger the R^{2}, the better the model fits the data. However, the R^{2} can be inflated towards its maximum value 1 by adding more explanatory variables to the model. The extreme case is where the number of parameters (including the intercept) is equal to n, so that RSS = 0 and thus R^{2} =1. The
Adjusted R^{2} = 1 - [RSS / (n-k)] / [TSS / (n-1)]
corrects the RSS and the TSS for the degrees of freedom, in order to penalize
for the inflationary effect of the number of parameters. The corrections
are based on the facts that if the model y_{j} = a
+ u_{j} is correct, then
Under the null
hypothesis that none of the explanatory variables have any effect on
y_{j},
in the regression model with an intercept,
has an F_{k}_{-1,n-k} distribution with k-1 and n-k degrees of freedom. The statistic F is the test statistic of the "overall" F test of the null hypothesis that none of the explanatory variables matter. This test is a right-sided test: The null hypothesis is rejected if the value of the test statistic is larger than the critical value. Rejection of this hypothesis indicates that at least one of the explanatory variables x_{i j} has a non-zero slope parameter b_{i}. Thus, rejection is good! Note that if the model does not contain an intercept then this F test is not valid, hence EasyReg will not report it.
EasyReg also reports two other tests:
The normality
assumption is not crucial if the sample size n is large, because
due to law of large numbers and the central limit theorem the OLS estimators
will still be approximately normally distributed around the true parameter
values. However, heteroskedasticity will render the t-values and p-values
invalid, and the OLS estimation method inefficient. The latter means that
in the case of heteroskedasticity there exists a better method to estimate
the parameters, in the sense that it is possible to estimate the parameters
by an alternative method such that the variances of the alternative estimators
will be lower than the variances of the corresponding OLS estimators. Which
alternative estimation method would be better depends on what is known
about the conditional variance
If the Breusch-Pagan test rejects the homoskedasticity assumption, it is possible to correct the t-values and p-values for the effect of heteroskedasticity, as shown by:
Finally, EasyReg
reports the asymptotic standard variance matrix and the asymptotic HC variance
matrix of the
Q = A.L^{a}.K^{b}e^{u}
where
lnQ = lnA + alnL + blnK + u,
where lnA is the intercept. Now suppose that you want to estimate this model under the restriction of homogeneity of degree 1, i.e., if both K and L increase with say 10% then so will Q. This condition is equivalent to:
a + b = 1.
Thus, replace
a with
lnQ = lnA + (1 - b)lnL + blnK + u.
This model can be reformulated and estimated as an unrestricted linear regression model, as follows:
Y = lnQ - lnL = lnA + b(lnK - lnL) + u = b_{0} + b_{1}X + u,
say, where
y_{t} = b_{1}x_{1,t} + ... + b_{k-1}x_{k-1,t} + b_{k} + u_{t}, t = 1,2,...,n.
However, this is not the only difference with the classical linear regression model. There are a few crucial differences. The first one is that the model variables are no longer independent across the observations t. However, it is necessary to restrict the dependence:
Á_{t}
= {
we must have that E[u_{t} | Á_{t}] = 0.
Next to the just mentioned tests, in the time series regression case EasyReg also reports the value of the Durbin-Watson (DW) test for first-order autocorrelation of the errors u_{t}. The alternative hypothesis of this test is that
u_{t} = ru_{t - 1} + e_{t} for some r satisfying 0 < |r| < 1,
where now E[e_{t} | Á_{t}] = 0 and E[e_{t}^{2} | Á_{t}] = s^{2}, and the null hypothesis is that r = 0. Under the null hypothesis the DW statistic should be close to 2. The DW test is one of the few tests for which EasyReg does not have build in critical values. Thus, in order to conduct the DW test you have to look up the critical values in an econometrics textbook.
Note that the
DW test is only valid if the model does not contain lagged dependent variables,
i.e., none of the regressors
The stationarity hypothesis has to be tested separately for each of the time series
A typical time series regression model with lagged dependent variable x_{1,t} = y_{t-1} takes the form
y_{t} = b_{1}y_{t-1} + b_{2}x_{2,t} + ... + b_{k-1}x_{k-1,t} + b_{k} + u_{t}.
If b_{1}
= 1, then y_{t} is a unit root process, even if the
LN[unemployment]
= b_{1}LN[unemployment]_{
-1} + b_{2}(LN[real
GNP]_{ -1} - LN[real GNP]_{ -2})
+ b_{3}(LN[real
wage]_{ -1} - LN[real wage]_{ -2}) + b_{4}
+ u,
where the negative subscripts indicate the lags.
Double click LN[real GNP] and LN[real wage], and click the "Selection OK" button. Then the window changes to:
If you click "O.K." the transformations involved are added to the data file, and you will jump back to the first window. The "Cancel" button in the first transformation window has now become the "Done" button. Click it. Then you will jump back to the EasyReg main window.
Double click the variables LN[unemployment], DIF1[LN[real GNP]], and DIF1[LN[real wage]], and click "Selection OK".
We are not going to select a subset of observations. Thus, click "No" and then click "Continue".
Double click the dependent variable, LN[unemployment], and click "Continue".
This window is only for your information. The only action required is to click "Continue".
EasyReg has automatically selected the other variables as the independent variables. Click "Selection OK".
Now we have to select the lagged dependent and (lagged) independent variables, in the next three windows. These windows only appear if you have declared your data as time series.
We are now done with the selection of the lagged dependent and independent variables. Click "Selection OK".
EasyReg automatically adds the constant 1 for the intercept to the model. Click "Continue".
This window only appears if you have declared your data as time series data. I will assume that the text on the button "I have no idea what you are talking about!" applies to you. Thus, click it.
This is the last step of the model specification. Click "Continue". Then the estimation results will be computed:
If you click "Continue" the module NEXTMENU will be activated with options for further analysis, including the default option to store the output in file OUTPUT.TXT in the EASYREG.DAT subfolder.
However, if you click "Done" the output will not be written to file OUTPUT.TXT. Therefore, click "Done" only if you have made a mistake in specifying the model.
Thus, click "Continue":
Dependent variable: Y = LN[unemployment] Characteristics: LN[unemployment] First available observation = 31(=1890) Last available observation = 129(=1988) First chosen observation = 51(=1910) Last chosen observation = 129(=1988) Number of usable chosen observations: 79 Subsample characteristics: Minimum value: 1.8232000E-001 Maximum value: 3.2148700E+000 Sample mean: 1.7394800E+000 X variables: X(1) = LAG1[LN[unemployment]] X(2) = LAG1[DIF1[LN[real GNP]]] X(3) = LAG1[DIF1[LN[real wage]]] X(4) = 1 Model: Y = b(1)X(1) +.....+ b(4)X(4) + U, where U is the error term, satisfying E[U|X(1),...,X(4)] = 0. OLS estimation results Parameters Estimate t-value H.C. t-value (S.E.) (H.C. S.E.) [p-value] [H.C. p-value] b(1) 0.7423636 10.907 10.147 (0.06806) (0.07316) [0.00000] [0.00000] b(2) -3.1431192 -3.293 -2.591 (0.95457) (1.21303) [0.00099] [0.00957] b(3) 1.3967263 0.892 0.745 (1.56506) (1.87592) [0.37216] [0.45654] b(4) 0.5178157 3.911 3.368 (0.13239) (0.15374) [0.00009] [0.00076] Notes: 1: S.E. = Standard error 2: H.C. = Heteroskedasticity Consistent. These t-values and standard errors are based on White's heteroskedasticity consistent variance matrix. 3: The two-sided p-values are based on the normal approximation. Effective sample size (n): 78 Variance of the residuals: 0.13976107 Standard error of the residuals (SER): 0.37384631 Residual sum of squares (RSS): 10.34231885 (Also called SSR = Sum of Squared Residuals) Total sum of squares (TSS): 31.68915115 R-square: 0.6736 Adjusted R-square: 0.6604 Overall F test: F(3,74) = 50.91 p-value = 0.00000 Significance levels: 10% 5% Critical values: 2.16 2.73 Conclusions: reject reject Test for first-order autocorrelation: Durbin-Watson test = 1.883484 WARNING: Since the model contains a lagged dependent variable, the Durbin-Watson test is NOT valid! REMARK: A better way of testing for autocorrelation is to specify AR errors and then test the null hypothesis that the AR parameters are zero. Jarque-Bera/Salmon-Kiefer test = 8.159389 Null hypothesis: The errors are normally distributed Null distribution: Chi-square(2)) p-value = 0.01691 Significance levels: 10% 5% Critical values: 4.61 5.99 Conclusions: reject reject Breusch-Pagan test = 3.876958 Null hypothesis: The errors are homoskedastic Null distribution: Chi-square(3) p-value = 0.27506 Significance levels: 10% 5% Critical values: 6.25 7.81 Conclusions: accept accept Information criteria: Akaike: -1.917900621 Hannan-Quinn: -1.869519398 Schwarz: -1.797043758 If the model is correctly specified, in the sense that the conditional expectation of the model error U relative to the X variables and all lagged dependent (Y) variables and lagged X variables equals zero, then the OLS parameter estimators b(1),..,b(4), minus their true values, times the square root of the sample size n, are (asymptotically) jointly normally distributed with zero mean vector and variance matrix: 3.61359955E-01 5.23653940E-01 7.22206358E-01 -6.55265904E-01 5.23653940E-01 7.10732618E+01 -7.25338223E+01 -1.86985499E+00 7.22206358E-01 -7.25338223E+01 1.91054973E+02 -2.09074028E+00 -6.55265904E-01 -1.86985499E+00 -2.09074028E+00 1.36703566E+00 provided that the conditional variance of the model error U is constant (U is homoskedastic), or 4.17455356E-01 1.10368137E+00 9.37306154E-01 -8.37497515E-01 1.10368137E+00 1.14772086E+02 -1.22726118E+02 -3.67345075E+00 9.37306154E-01 -1.22726118E+02 2.74486770E+02 -1.74631925E+00 -8.37497515E-01 -3.67345075E+00 -1.74631925E+00 1.84366741E+00 if the conditional variance of the model error U is not constant (U is heteroskedastic).
As you see, the Breusch-Pagan test accepts the homoskedasticity condition at the 10% significance level. Therefore, I will ignore the HC t-values and p-values, and judge the significance of the parameters by the values of their standard t-values and p-values.
The parameter
estimate b(3) is not significantly different from zero, at any conventional
significance level. This indicates that the variable
The parameter
b(2) = -3.1431192 is significantly different from zero if tested two-sided, and the
negative sign is what you would expect. It is also easy to test the significance
of this parameter by a left-sided test, on the basis of the corresponding
(standard) p-value 0.00099. Recall that this p-value is computed as
The parameter
b(2) can be interpreted as an elasticity, due to the fact that the derivative
of ln(x) is 1/x:
100 *(real GNP - real GNP_{- 1}) / (real GNP_{- 1})
increases with,
say, 1 percentage point, then in the next period unemployment will decrease
with approximately -b(2)% = 3.1431192%, ceteris paribus (= everything else
being equal or constant). Note that the latter decrease is relative rather
than absolute in unemployment rate points. Thus, the estimated effect of
1 percentage
point increase in the real GDP percentage growth
rate in period
100 *(real GNP_{t - 1} - real GNP_{t - 2}) / (real GNP_{ t - 2}) - 100 *(real GNP_{t - 2} - real GNP_{t - 3}) / (real GNP_{ t - 3}) = 1,
on unemployment in period t is:
100(unemployment_{t} - unemployment_{t - 1}) / (unemployment_{t - 1}) = b(2) = -3.1431192.
However, due
to the presence of the lagged dependent variable there is also an effect
on unemployment in period t + j for j = 1,2,3,... If the above change in
the real GDP growth rate is a "once and for all" change, i.e., the growth
rate of real GDP remains constant after period
100(unemployment_{t
+ j} - unemployment_{t +j - 1}) / (unemployment_{t +j -
1})
hence the estimated long-run effect of a once and for all change of the real GDP growth rate with 1 percentage point is:
S_{j}100(unemployment_{t
+ j} - unemployment_{t +j - 1}) / (unemployment_{t +j -
1})
where the summation is taken over j = 0,1,2,.....
Once you have estimated the model by OLS, you will have a variety of options for further inference. The options menu opens if you click the "Options" button in the previous window:
In this guided tour I will only focus on those options that are covered by intermediate econometrics textbooks. Thus, I will not discuss the KVB test, the ICM test, and the kernel estimate of the error density.
Double click b(2) and b(3), and click "Test joint significance". Then the test results appear:
These test results are also written to the output file.
Next, let us
test the hypothesis that the true values b_{2}
and b_{3}
of
The linear restriction
involved has to be entered in the form
Click "No more restrictions". Then the test results appear.
The "Back" button brings you back to the "What to do next?" window.
u_{t} = ru_{t}_{ - 1} + e_{t} - de_{t}_{ - 1},
and an AR(1) = ARMA(1,0) process has the form
u_{t} = ru_{t} _{- 1} + e_{t},
where E[e_{t} | Á_{t}] = 0 and E[e_{t}^{2} | Á_{t}] = s^{2}.
In this example I will re-estimate the model with AR(1) errors, in order to test for first-order autocorrelation as an alternative to the Durbin-Watson test.
The coefficient
a(1,1) corresponds to r
and the coefficient a(2,1) corresponds to d.
Since we are going to specify an AR(1) process, double click
The RSS will now by minimized by the simplex method of Nelder and Mead. Click "Start SIMPLEX iteration".
It is recommended to restart the simplex iteration until the parameters do not change anymore. Thus, leave "Auto restart" checked and click "Restart SIMPLEX iteration:
Then click "Done with SIMPLEX iteration".
Click "Continue". Then the estimation results appear.
Note that the
AR(1) parameter
s_{t}^{2} = E[u_{t}^{2} | Á_{t}] = a_{0} + a_{1}u_{t}_{-1}^{2},
which can be written as an AR(1) model:
u_{t}^{2} = a_{0} + a_{1}u_{t}_{-1}^{2} + v_{t},
where E[v_{t} | Á_{t}] = 0. Moreover, in the case of GARCH(1,1) errors the conditional variance of u_{t} has the form
s_{t}^{2} = g_{1}s_{t-1}^{2} + a_{0} + a_{1}u_{t}_{-1}^{2},
Note that (G)ARCH is quite different from the alternative hypothesis of the Breusch-Pagan test. Therefore, if the Breush-Pagan test accepts the null hypothesis of homoskedasticity then this does not imply absence of (G)ARCH.
In this example I will re-estimate the OLS model with ARCH(1) errors.
Test of homoskedasticity against ARCH
However, before you re-estimate the model, you better test for ARCH errors first:
The null hypothesis of no ARCH has been accepted.
Re-estimate the model with (G)ARCH errors
Nevertheless, I will re-estimate the model with ARCH(1) errors, in order to show how to do that.
The parameter
r(1,1) corresponds to the GARCH(1,1) parameter g_{1},
and the parameter
The parameter d is the estimate of a_{0} and the parameter r(2,1) is the estimate of a_{1}. Since r(2,1) is not significant at any conventional significance level, we may conclude that a_{1} = 0, hence
s_{t}^{2} = E[u_{t}^{2} | Á_{t}] = a_{0}
is constant. Thus, there is no ARCH(1). However, similar to the ARMA error case all further options will now involve the model with ARCH(1) errors. Therefore, in order to undo the re-estimation, you have to estimate the model again by OLS.
Before doing this, let us look at the option "Plot the GARCH variances":
This is the plot of the estimated conditional variances s_{t}^{2} = E[u_{t}^{2} | Á_{t}]. Since we have accepted the null hypothesis of no ARCH one should expect to see a horizontal straight line. However, the estimated coefficient r(2,1) is quite large: r(2,1) = 0.344673, so that what you see here is the plot of the function
s_{t}^{2} = d + r(2,1)u_{t}-1^{2} = 0.093678 + 0.344673u_{t}-1^{2}.
The option "Plot the fit" does not need much explanation. The only points worthwhile to mention is that if you move the mousepointer over the picture, the corresponding date is displayed in the title bar of the window, and if you double click on the picture the corresponding date is printed.