Lecture on Cross Section and Panel Data

1

Research MethodResearch Method

Lecture 11-3 Lecture 11-3 (Ch15)(Ch15)

Instrumental Instrumental Variables Variables

Estimation and Two Estimation and Two Stage Least SquareStage Least Square©

IV solution to Errors in IV solution to Errors in variable problems: Example variable problems: Example

11 Consider the following model Y=β0+β1x1

*+β2x2+uWhere x1

* is the correctly measured variable. Suppose, however, that you only have error ridden variable x1=x1

*+e1.Thus, the actual estimation model becomes Y=β0+β1x1+β2x2+(u- β1e1)Thus, the OLS estimate of β1 is biased. This is

the error-in-variable bias.

2

The error-in-variable bias cannot be corrected with the panel data method. But IV method can solve the problem.

Suppose that you have another measure for x1

*. Call this z1. For example consider x1* is

the husband’s annual salary, and x1 is the annual salary reported by the husband, which is reported with errors. Sometimes, the data also asks the wife to report her husband’s annual salary. Then z1 is the husband’s annual salary reported by the wife.

3

In this case, z1=x1*+a1 where a1 is the measurement error.

Although z1 is measured with errors, it can serve as the instrument for x1. Why? First, x1 and z1 should be correlated. Second, since e1 and a1 are just measurement errors, they are unlikely to be correlated, which means that z1 is uncorrelated with the error term (u- β1e1).

Y=β0+β1x1+β2x2+(u- β1e1) So, 2SLS with z1 as an instrument can

eliminate this bias.

4

IV solution to Errors in variable IV solution to Errors in variable problems: Example 2problems: Example 2

This is a more complicated example. Consider the following model.

where we have the unobserved ability problem.

Suppose that you have two test scores that are the indicators of the ability.

test1 = γabil+e1

test2 = δabil+e2

5

)()log( 10 uabileducwage

If you use test1 as the proxy variable for ability, you have the following model.

where . Thus, test1 is correlated with the error term: It has the error-in-variable problem. In this case, a simple plug-in-solution does not work.

However, since you have test2, another measure of abil, you can use test2 as an instrument for test1 in the 2SLS procedure to eliminate the bias.

6

)()log( 111110 eutesteducwage

11 /1

ExerciseExercise Using WAGE2.dta, consider a log-

wage regression with explanatory variables educ exper tenure married south urban and black. Using IQ and KWW (knowledge of the world of work) as two measures of the unobserved ability, estimate the model that correct for the bias in educ.

7

8

_cons 5.395497 .113225 47.65 0.000 5.17329 5.617704 black -.1883499 .0376666 -5.00 0.000 -.2622717 -.1144281 urban .1839121 .0269583 6.82 0.000 .1310056 .2368185 south -.0909036 .0262485 -3.46 0.001 -.142417 -.0393903 married .1994171 .0390502 5.11 0.000 .1227801 .276054 tenure .0117473 .002453 4.79 0.000 .0069333 .0165613 exper .014043 .0031852 4.41 0.000 .007792 .020294 educ .0654307 .0062504 10.47 0.000 .0531642 .0776973 lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 165.656283 934 .177362188 Root MSE = .36547 Adj R-squared = 0.2469 Residual 123.818521 927 .133569063 R-squared = 0.2526 Model 41.8377619 7 5.97682312 Prob > F = 0.0000 F( 7, 927) = 44.75 Source SS df MS Number of obs = 935

. reg lwage educ exper tenure married south urban black

OLS

9

_cons 5.176439 .1280006 40.44 0.000 4.925234 5.427644 black -.1431253 .0394925 -3.62 0.000 -.2206304 -.0656202 urban .1819463 .0267929 6.79 0.000 .1293645 .2345281 south -.0801695 .0262529 -3.05 0.002 -.1316916 -.0286473 married .1997644 .0388025 5.15 0.000 .1236134 .2759154 tenure .0113951 .0024394 4.67 0.000 .0066077 .0161825 exper .0141458 .0031651 4.47 0.000 .0079342 .0203575 educ .0544106 .0069285 7.85 0.000 .0408133 .068008 IQ .0035591 .0009918 3.59 0.000 .0016127 .0055056 lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]


. reg lwage IQ educ exper tenure married south urban black

Instruments: educ exper tenure married south urban black KWWInstrumented: IQ _cons 4.592453 .324209 14.17 0.000 3.957015 5.227891 black -.0225612 .0736029 -0.31 0.759 -.1668202 .1216979 urban .1767058 .0280756 6.29 0.000 .1216785 .231733 south -.0515532 .0309777 -1.66 0.096 -.1122685 .009162 married .2006903 .0404813 4.96 0.000 .1213485 .2800322 tenure .0104562 .0025887 4.04 0.000 .0053824 .01553 exper .01442 .0033047 4.36 0.000 .0079429 .0208972 educ .0250321 .0165266 1.51 0.130 -.0073595 .0574238 IQ .0130473 .0049103 2.66 0.008 .0034234 .0226712 lwage Coef. Std. Err. z P>|z| [95% Conf. Interval]

Root MSE = .37884 R-squared = 0.1900 Prob > chi2 = 0.0000 Wald chi2(8) = 298.58Instrumental variables (2SLS) regression Number of obs = 935

. ivregress 2sls lwage educ exper tenure married south urban black (IQ=KWW)

Simple plug in solution

Plug in + IVusing KWW as the instrument for IQ

2SLS with 2SLS with heteroskedasticityheteroskedasticity

When heteroskedasticity is present, we have to modify the standard error formula.

The derivation of the formula is not the scope of this class. However, STATA automatically compute this. Just use robust option.

10

Testing overidentifying Testing overidentifying restrictionsrestrictions

Usually, the instrument exogeneity cannot be tested.

However, when you have extra instruments, you can effectively test this. This is the test of overidentifying restrictions.

11

The basic idea behind the test of The basic idea behind the test of overidentifying restrictionsoveridentifying restrictions

Before presenting the procedure, I will provide you with the basic idea of the test.

Consider the following model. y1=β0+β1y2+β2z1+β3z2+u1

Suppose you have two instruments for y2: z3 z4. If both instruments are valid instruments, using either z3 or z4 as an instrument will produce consistent estimates.

Let be the IV estimator when z3 is used as an instrument. Let be the IV estimate when z4 is used as an instrument

12

1~

1

The idea is to check if and are similar. That is, you test H0: .

If you reject this null, it means that either z3 or z4, or both of them are not exogenous. We do not know which one is not exogenous. So the rejection of the null typically means that your choice of instruments is invalid.

13

1

1~

0~11

On the other hand, if you fail to reject the null hypothesis, we can have some confidence in the overall set of instruments used.

However, caution is necessarily. Even if you fail to reject the null, this does not always mean that the set of instruments are valid.

For example, consider wage regression with education being the endogenous variable. And you have mother and father’s education as instruments.

14

Even if mother and father’s education do not satisfy the instrument exogeneity, may be very close to zero since the direction of the biases are the same. In this case, even if they are invalid instruments, we may fail to reject the null (i.e., erraneously judge that they satisfy the instrument exogeneity).

15

11~

The procedure of the test of The procedure of the test of overidentifying restrictionsoveridentifying restrictions

The procedure:(i)Estimate the structural equation by 2SLS and

obtain the 2SLS residuals, .(ii)Regress on al exogenous variables. Obtain R-

squared. Say R12.

(iii)Under the null that all IVs are uncorrelated with the structural error u1, where q is the number of extra instruments.

If you fail to reject the null (i.e., if nR12 is small), then

you have some confidence about the instrument exogeneity. If you reject it, at least some of the instruments are not exogenous.

16

1u

1u

221 ~ qRn

The NR12 statistic is valid when

homoskedasticity assumption holds. NR1

2 statistic is also calld the Sargan’s statistic.

When we assume heteroskedasticity, we have to use another statistic called the Hansen’s J statistic.

Both tests can be done automatically using STATA.

17

ExerciseExercise Consider the following model.

Log(wage)=β0+β1(educ)+β2Exper+β3Exper2+u

1.Using Mroz.dta, estimate the above equation using motheduc & fathereduc as instruments for (educ).

2.Test the overidentifying restrictions.

18

Answers 1Answers 1

19 _cons .0481003 .398453 0.12 0.904 -.7328532 .8290538 expersq -.000899 .0003998 -2.25 0.025 -.0016826 -.0001154 exper .0441704 .0133696 3.30 0.001 .0179665 .0703742 educ .0613966 .0312895 1.96 0.050 .0000704 .1227228 lwage Coef. Std. Err. z P>|z| [95% Conf. Interval]


. ivregress 2sls lwage exper expersq (educ=motheduc fatheduc)

_cons -.5220406 .1986321 -2.63 0.009 -.9124667 -.1316144 expersq -.0008112 .0003932 -2.06 0.040 -.0015841 -.0000382 exper .0415665 .0131752 3.15 0.002 .0156697 .0674633 educ .1074896 .0141465 7.60 0.000 .0796837 .1352956 lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]


. reg lwage educ exper expersq

OLS

2SLS

Answer: 2Answer: 2 First, conduct the test “manually”.

20

Instruments: exper expersq motheduc fatheducInstrumented: educ _cons .0481003 .398453 0.12 0.904 -.7328532 .8290538 expersq -.000899 .0003998 -2.25 0.025 -.0016826 -.0001154 exper .0441704 .0133696 3.30 0.001 .0179665 .0703742 educ .0613966 .0312895 1.96 0.050 .0000704 .1227228 lwage Coef. Std. Err. z P>|z| [95% Conf. Interval]



1. First, estimate 2SLS.

(325 missing values generated). predict uhat, resid

2. Second, generate the 2sls residual. Call this uhat.

21

. gen rsq=e(r2)

_cons .0109641 .1412571 0.08 0.938 -.2666892 .2886173 fatheduc .0057823 .0111786 0.52 0.605 -.0161902 .0277547 motheduc -.0066065 .0118864 -0.56 0.579 -.0299704 .0167573 expersq 7.34e-07 .0003985 0.00 0.999 -.0007825 .000784 exper -.0000183 .0133291 -0.00 0.999 -.0262179 .0261813 uhat Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 193.020013 427 .452037502 Root MSE = .67521 Adj R-squared = -0.0086 Residual 192.84951 423 .455909007 R-squared = 0.0009 Model .170503136 4 .042625784 Prob > F = 0.9845 F( 4, 423) = 0.09 Source SS df MS Number of obs = 428

. reg uhat exper expersq motheduc fatheduc 3. Third, regress uhat on all the exogenous variables. Don’t forget to include exogenous variables in the structural equation: exper and expersq

4. Fourth, get the R-squared from this regression. You can use this, but this is rounded. To compute more precisely, type this.

rsq 753 .0008833 0 .0008833 .0008833 Variable Obs Mean Std. Dev. Min Max

. su rsq

n_rsq 753 .3780714 0 .3780714 .3780714 Variable Obs Mean Std. Dev. Min Max

. su n_rsq

. gen n_rsq=428*rsq

5. Finally, compute NR2.

This is NR2 stat. This is also called the Sargan’s statistic

The NR2 stat follows χ2(1).The degree of

freedom is equal to the number of extra instruments. In our case it is 1. (In our mode, there is only one endogenous variable. Thus, you need only one instrument. But we have two instruments. Therefore the number of extra instrumetn is 1. )

Since the 5% cutoff point for χ2(1) is 3.84, we

failed to reject the null hypothesis that exogenous variables are not correlated with the structural error.

Thus, we have some confidence in the choice of instruments. In other word, our instruments have ‘passed’ the test of overidentifying restrictions.

22

Now, let us conduct the test of overidentifying restriction automatically.

23 Basmann chi2(1) = .373985 (p = 0.5408) Sargan (score) chi2(1) = .378071 (p = 0.5386)

Tests of overidentifying restrictions:

. estat overid




This is NR2 stat. It is also called the Sargan’s statistic.

The heteroskedasticity version can also be done automatically.

24

Score chi2(1) = .443461 (p = 0.5055)

Test of overidentifying restrictions:

. estat overid

Instruments: exper expersq motheduc fatheducInstrumented: educ _cons .0481003 .4277846 0.11 0.910 -.7903421 .8865427 expersq -.000899 .0004281 -2.10 0.036 -.001738 -.00006 exper .0441704 .0154736 2.85 0.004 .0138428 .074498 educ .0613966 .0331824 1.85 0.064 -.0036397 .126433 lwage Coef. Std. Err. z P>|z| [95% Conf. Interval] Robust


. ivregress 2sls lwage exper expersq (educ=motheduc fatheduc), robust

Use robust option when estimating 2SLS.

Then type the same command.

Heteroskedasticity robust version is called the Hansen’s J statistic

Even if you fail to reject the null hypothesis in the test, there is a possibility that your instruments are still invalid.

Thus, even if your instruments “pass the test”, in general, you should try to provide a plausible “story” why your instruments satisfy the instrument exogeneity. (Quarter of birth is a good example).

25

Note

Testing the endogeneity Testing the endogeneity Consider again the following model. y1=β0+β1y2+β2z1+β3z2+u1

Where y2 is the suspected endogenous variable and you have instruments z3 and z4.

If y2 is actually exogenous, OLS is better. If you have valid instruments, you can test if

y2 is exogenous or not.

26

Before laying out the procedure, let us understand the basic idea behind the test.

Structural eq:y1=β0+β1y2+β2z1+β3z2+u1

Reduced eq :y2=π0+π1z1+ π2z2+ π3z3+ π4z4+v2

You can check that y2 is correlated with u1 only if v2 is correlated with u1.

Further, let u1=δv2+e1. Then u1 and v2 are correlated only if δ =0. Thus, consider

y1=β0+β1y2+β2z1+β3z2+ δv2+e1

then, test if δ is zero or not.27

The test of endogeneity: The test of endogeneity: procedureprocedure

(i) Estimate the reduced form equation using OLS.

y2=π0+π1z1+ π2z2+ π3z3+ π4z4+v2

Then obtain the residual .(ii) Add to the structural equation and

estimate using OLS y1=β0+β1y2+β2z1+β3z2+α +e1

Then, test H0: α=0. If we reject H0, then we conclude that y2 is

endogenous because u1 and v2 are correlated. 28

2v

2v

2v

ExerciseExercise Consider the following model.

Log(wage)=β0+β1(educ)+β2Exper+β3Exper2+u

Suppose that father and mother’ education satisfy the instrument exogeneity. Conduct the Hausman test of endogeneity to check if (educ) is exogenous or not.

29

AnswerAnswer First, conduct the test “manually”.

30

. gen fullsample=e(sample)




To use the same observations as 2SLS, run 2SLS once and generate this variable

31 _cons .0481003 .3945753 0.12 0.903 -.7274721 .8236727uhat_reduced .0581666 .0348073 1.67 0.095 -.0102501 .1265834 expersq -.000899 .0003959 -2.27 0.024 -.0016772 -.0001208 exper .0441704 .0132394 3.34 0.001 .0181471 .0701937 educ .0613966 .0309849 1.98 0.048 .000493 .1223003 lwage Coef. Std. Err. t P>|t| [95% Conf. Interval]


. reg lwage educ exper expersq uhat_reduced

. predict uhat_reduced, resid

_cons 9.10264 .4265614 21.34 0.000 8.264196 9.941084 fatheduc .1895484 .0337565 5.62 0.000 .1231971 .2558997 motheduc .157597 .0358941 4.39 0.000 .087044 .2281501 expersq -.0010091 .0012033 -0.84 0.402 -.0033744 .0013562 exper .0452254 .0402507 1.12 0.262 -.0338909 .1243417 educ Coef. Std. Err. t P>|t| [95% Conf. Interval]

Total 2230.19626 427 5.22294206 Root MSE = 2.039 Adj R-squared = 0.2040 Residual 1758.57526 423 4.15738833 R-squared = 0.2115 Model 471.620998 4 117.90525 Prob > F = 0.0000 F( 4, 423) = 28.36 Source SS df MS Number of obs = 428

. reg educ exper expersq motheduc fatheduc if fullsample==1

Now run the reduced for regression, then get the residual.

Then check if this coefficient is different from zero.

The coefficient on uhat is significant at 10% level. Thus, you reject the null hypothesis that educ is exogenous (not correlated with the structural error) at 10% level.

This is a moderate evidence that educ is endogenous and thus 2SLS should be reported (along with OLS).

32

Stata conduct the test of endogeneity automatically. Stata uses a different version of the test.

33 Wu-Hausman F(1,423) = 2.79259 (p = 0.0954) Durbin (score) chi2(1) = 2.80707 (p = 0.0938)

Ho: variables are exogenous Tests of endogeneity

. estat endog




Note that the test of endogeneity is valid only if that the instruments satisfy the instrument exogeneity.

Thus, test the overidentifying restrictions first to check if the instruments satisfy the instrument exogeneity. If instruments “pass” the overidentifying test, then conduct the test of endogeneity.

34

Applying 2SLS to pooled Applying 2SLS to pooled cross sectionscross sections

When you simply apply 2SLS to the pooled cross section data, there is no new difficulty. You can just apply 2SLS.

35

Combining panel data Combining panel data method and IV methodmethod and IV method

Suppose you have two period panel data. The period is 1987 and 1988. Consider the following model.

Log(scrap)it=β0+δ0d88t+β1(hrsemp)it+ai+uit

Where (scrap) is the scrap rate. (Hrsemp) is the hours of employee training. You have data

36

Correlation between ai and (hrsemp)it causes a bias in β1. In the first differened model, we difference to remove ai: that is, we estimate

∆Log(scrap)it=δ0+β1∆ (hrsemp)it+∆uit…(1)

In some case, ∆ (hrsemp)it and ∆uit can still be correlated. For example, when a firm hires more skilled workers, they may reduce the job training.

37

In this case, the quality of the worker is time varying, so it is not contained in ai, but it is contained in uit. In this case ∆(hrsemp)it and∆uit may be negatively correlated. This would cause OLS estimate of β1 to be biased upward (bias towards not finding the productivity enhancing effect of training).

To eliminate the bias, we can apply IV

method to equation (1).38

One possible instrument for ∆(hrsemp)it is the ∆(Grant)it. (Grant)it is a variable indicating if the company received job training grant. Since the grant designation is given at the beginning of 1988, ∆(Grant)it may be uncorrelated with ∆uit. At the same time, it would be correlated with ∆(hrsemp)it. Thus, we can use ∆(Grant)it as an IV for ∆(hrsemp)it.

39

ExerciseExercise Using JTRAIN.dta, estimate the

following model.

∆Log(scrap)it=δ0+β1∆ (hrsemp)it+∆uit…(1)

Use ∆(Grant)it as an instrument for ∆(hrsemp)it. Use the data between 1987 and 1988 only.

40

Answer Answer First estimate it manually.

41 _cons -.1035161 .103736 -1.00 0.324 -.3127197 .1056875 dhrsemp -.0076007 .0045112 -1.68 0.099 -.0166984 .0014971 dlscrap Coef. Std. Err. t P>|t| [95% Conf. Interval]


. reg dlscrap dhrsemp if year<=1988

(157 missing values generated). gen dgrant=grant-L.grant

(220 missing values generated). gen dhrsemp=hrsemp-L.hrsemp

(363 missing values generated). gen dlscrap =lscrap-L.lscrap

delta: 1 unit time variable: year, 1987 to 1989 panel variable: fcode (strongly balanced). tsset fcode year

This is the simple first differenced model

42

Instruments: dgrantInstrumented: dhrsemp _cons -.0326684 .124098 -0.26 0.792 -.275896 .2105592 dhrsemp -.0141532 .0077369 -1.83 0.067 -.0293171 .0010108 dlscrap Coef. Std. Err. z P>|z| [95% Conf. Interval]


. ivregress 2sls dlscrap (dhrsemp=dgrant) if year<=1988

This is the first-differenced model + IV method

Now, estimate the model automatically.

43 Instruments: grantInstrumented: hrsemp rho .8264778 (fraction of variance due to u_i) sigma_e .62904299 sigma_u 1.3728352 _cons -.0326684 .1269512 -0.26 0.797 -.2814881 .2161513 D1. -.0141532 .0079147 -1.79 0.074 -.0296658 .0013594 hrsemp D.lscrap Coef. Std. Err. z P>|z| [95% Conf. Interval]

corr(u_i, Xb) = -0.2070 Prob > chi2 = 0.0737 Wald chi2(1) = 3.20

overall = 0.0016 max = 1 between = 0.0016 avg = 1.0R-sq: within = . Obs per group: min = 1

Time variable (t): year Number of groups = 45Group variable: fcode Number of obs = 45First-differenced IV regression

. xtivreg lscrap (hrsemp=grant) if year<=1988, fd

delta: 1 unit time variable: year, 1987 to 1989 panel variable: fcode (strongly balanced). tsset fcode year

First differenced model + IV method.

Documents

Lecture on Cross Section and Panel Data