15
1 Lecture 1: Introduction Basic Econometrics Iris wang [email protected] Welcome! This is the first lecture on the course ” Econometrics, 7.5hp” STGA02 & NEGB22 Textbook Gujarati, D. N. (2003) Basic Econometrics (fifth edition),McGraw-Hill. Course structure 11 lectures, 4 computer classes, work on jointly with your group. SPSS It’slearning: www.its.kau.se The course will be assessed as follows: Lab/Two assignments: 1,5 hp/ECTS (only G) Written exam: 6 hp/ECTS Examination: max 20 points (G≥10p, VG≥15p.) Examination The overall assessment is based on a written exam. The written exam is a closed book exam. Statistical tables and formula sheet will be supplied. The maximum score on the written exam is 20. The grade on each lab/assignment is Pass, hence the total overall mark on the labs/2 assignments is Pass (the labs are not compulsory and solutions have to be handed in on time for points to be awarded). The final marks are set according to the following principles: High Pass: A total score (labs/assignments+exam) higher than or equal to 15. Pass: A total score (exam) higher than or equal to 10 and lower than 15. Chapter 1: The Nature and Scope of Econometrics

Basic Econometrics Welcome! - Karlstad University€¦ · 1 Lecture 1: Introduction Basic Econometrics Iris wang [email protected] Welcome! This is the first lecture on the course

  • Upload
    ngodung

  • View
    239

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Basic Econometrics Welcome! - Karlstad University€¦ · 1 Lecture 1: Introduction Basic Econometrics Iris wang iris.wang@kau.se Welcome! This is the first lecture on the course

1

Lecture 1: Introduction

Basic Econometrics

Iris wang

[email protected]

Welcome!

This is the first lecture on the course

” Econometrics, 7.5hp” STGA02 & NEGB22

Textbook

• Gujarati, D. N. (2003) Basic Econometrics (fifth edition),McGraw-Hill.

Course structure

• 11 lectures, 4 computer classes, work on jointly with your group.

• SPSS • It’slearning: www.its.kau.se

• The course will be assessed as follows: • Lab/Two assignments: 1,5 hp/ECTS (only G) • Written exam: 6 hp/ECTS • Examination: max 20 points (G≥10p, VG≥15p.)

Examination

• The overall assessment is based on a written exam. • The written exam is a closed book exam. Statistical tables and

formula sheet will be supplied. The maximum score on the written exam is 20.

• The grade on each lab/assignment is Pass, hence the total overall mark on the labs/2 assignments is Pass (the labs are not compulsory and solutions have to be handed in on time for points to be awarded).

• The final marks are set according to the following principles: • High Pass: A total score (labs/assignments+exam) higher than or

equal to 15. • Pass: A total score (exam) higher than or equal to 10 and lower

than 15.

Chapter 1:

The Nature and Scope of Econometrics

Page 2: Basic Econometrics Welcome! - Karlstad University€¦ · 1 Lecture 1: Introduction Basic Econometrics Iris wang iris.wang@kau.se Welcome! This is the first lecture on the course

2

What is Econometrics?

• The measurement of economic relationships

• “the application of mathematical statistics to economic data to lend empirical support to models constructed by mathematical economics and to obtain numerical estimates” (Samuelson et al., Econometrica, 1954)

• aims of econometric modelling

explanation

policy evaluation

forecasting

Econometrics is used for:

– Estimating economic relationships

– Testing economic theories

– Evaluating & implementing policy

– Forecasting

For example:

What is the effect of education on wages?

How do training programs impact productivity?

Types of data

All empirical analysis requires data. We will now discuss a few different structures of data that you may come across if you do empirical analysis in economics:

1. Cross-sectional data

2. Time series data

3. Pooled cross sections

4. Panel (or longitudinal) data

1. Cross-sectional data

• A cross-sectional dataset consists of a sample of individuals, households, firms, … taken at a given point in time

• Cross-sectional datasets are often obtained from random sampling from the underlying population.

• If the sample has not been drawn randomly, our methods may have to be adjusted. For now, we assume random sampling unless I say otherwise.

Cross-sectional data (cont’d) • The dataset stored in the file WAGE1.SAV is a

cross-section dataset. Illustration: Binary /dummy variables (yes = 1, no = 0)

Observation number. In cross-section datasets, the ordering of observations doesn’t matter.

2. Time series data • A time series data set consists of observations on

one or several variables over time. • Unlike the arrangement of cross-sectional data, the

chronological ordering of observations in a time series is important.

• A key feature of time series data that makes them more difficult to analyze than cross-sectional data is that observations are unlikely to be independent over time.

• Special methodological problems arise when we analyze time series data.

Page 3: Basic Econometrics Welcome! - Karlstad University€¦ · 1 Lecture 1: Introduction Basic Econometrics Iris wang iris.wang@kau.se Welcome! This is the first lecture on the course

3

A time series dataset 3. Pooled cross-sections

• Some datasets have both cross-sectional and time series features.

• Example: household surveys from 1985 and 1990 which are combined to yield one dataset containing observations from both years.

• May be a useful basis for analysis of change of policy, for example, we often include time (year) as an additional explanatory variable in regressions based on pooled cross-sections.

3. Pooled cross-sections (cont’d)

Table 1: Two Years of Housing Prices

4. Panel (longitudinal) data

• A panel dataset consists of a time series for each cross-sectional member in the data set. Example:

• Key feature: the same cross-sectional units are followed over time.

• This is the big difference compared to a pooled cross-section.

• Several advantages, e.g. a. Enables the researcher to

control for certain unobserved characteristics of individuals (firms, cities…) that might otherwise lead to problems.

b. Can analyze dynamics.

Brief summary • Key ingredient: Data – typically in the form of large samples.

• Data = information.

1. Cross-sectional data—are data on one or more variables collected at one point in time.

2. Time series data—are collected over a period of time.

3. Pooled data—a combination of time series and cross-section.

Panel (longitudinal)data is a special type of pooled data, in which the same cross-sectional unit, say, a family or firm, is surveyed over time.

• Econometrics = a method for processing data and learn about general patterns in the population of interest.

• For example, what is the effect of education on labor market outcomes in the US?

Example:

• Data: wage1.sav

• These data were originally obtained from the 1976 Current Population Survey in the US.

• You can obtain the data from the course website. I very much encourage you to do so.

• First let’s look at summary statistics for this sample.

• To this end we will use a program called SPSS.

Page 4: Basic Econometrics Welcome! - Karlstad University€¦ · 1 Lecture 1: Introduction Basic Econometrics Iris wang iris.wang@kau.se Welcome! This is the first lecture on the course

4

SPSS version 18 looks like this:

And I can generate summary statistics by Analyze—>Descriptive Statistics —>Descriptives

Here is the summary statistics table from the SPSS output windows:

Specifying the Mathematical Model of labor market outcomes in the US

To see the relationship between ”education” and ”wage”, the first thing we should do is to plot the data for these two variables in a sctter diagram or scattergram, as shown below.

Plot scattergram via SPSS Here is a graph showing the relationship

between wages and education in the data:

(Try to generate this graph yourself. Hint: as I showed on the previous slide.)

What do we learn from the graph?

• Broad pattern: High levels of education are associated with high wages.

• But there are lots of individual exceptions!

• Wages vary a lot across individuals with the same level of education.

• The latter finding is particularly pronounced among people with high levels of education.

(each point in the graph shows wage & education for a particular individual)

Page 5: Basic Econometrics Welcome! - Karlstad University€¦ · 1 Lecture 1: Introduction Basic Econometrics Iris wang iris.wang@kau.se Welcome! This is the first lecture on the course

5

Capturing the ”broad pattern” in the data by means of regression

• The previous graph shows that high wages go hand in hand with high levels of education – at least on average.

• Regression is a statistical technique enabling us (under certain assumptions that we will study carefully later on) to quantify by how much the average – or expected - wage increases as education increases by (say) 1 year.

Education and Earnings

• Suppose you want to evaluate the effects of years of education on worker earnings.

• Do you need a theory?

• A plausible economic model:

wage = f ( educ, exper)

where

wage = hourly wage

educ = years of formal education

exper = years of work experience

An econometric model

where:

wage = average hourly earnings;

education = years of education;

An econometric model

• WAGE is the dependent variable in the model.

– In econometrics, the dependent variable is almost always to the left of the equal sign (’on the left-hand side’).

– The dependent variable is the outcome of interest, to be explained by other variables.

• Education is the explanatory (or independent) variable in the model.

– Explanatory variables are typically written to the right of the equal sign (’on the right-hand side’).

An econometric model

• We observe the dependent and the explanatory variables – i.e. we have data on these.

• Econometric jargon: ”we regress wage on education” – this means wage is the dependent variable.

The simplest wage model you will ever see…

• Consider the following mathematical model of wages:

• Variables: wage and education. These are observed in the data. We know values of wage and education for each individual.

• Parameters: β0 and β1. These are unobserved – their values are unknown. They are constant across individuals (in this model)

• The error term: εi. This is unobserved – we don’t know the values.

Page 6: Basic Econometrics Welcome! - Karlstad University€¦ · 1 Lecture 1: Introduction Basic Econometrics Iris wang iris.wang@kau.se Welcome! This is the first lecture on the course

6

OLS Estimation by SPSS

• Under a set of assumptions that we will study carefully throughout this course, we can estimate the unknown model parameters β0 and β1 using the wage1.sav dataset

• The two most important assumptions:

– the expected value of the residual is zero

– the residual is uncorrelated with education

• Why we need these assumptions will become clearer later on.

• I obtain OLS results in SPSS using Analyze—> Regression—> Linear

Regression results: Simple wage model Regression results: Simple wage model

These are the OLS estimates of the parameters β0 and β1.

• The constants β0, β1 are parameters (or coefficients) of the econometric model.

– β0 is often referred to as the intercept. Unless you have a compelling reason not to, you should always include an intercept in your models.

– β1 is constant describing the direction and strenght of the relationship between wage and the explanatory variable in the model. It is sometimes referred to as slope coefficient.

– The parameters of the model are typically unknown. Under certain assumptions, we can estimate these parameters using econometric methods.

Finally:

• ε = is an error term (or a disturbance term) containing unobserved factors. – Dealing with the error term ε is a very important component of

econometric analysis.

– (By the end of this course, you will understand what this statement means in practice.)

Page 7: Basic Econometrics Welcome! - Karlstad University€¦ · 1 Lecture 1: Introduction Basic Econometrics Iris wang iris.wang@kau.se Welcome! This is the first lecture on the course

7

Steps in empirical economic analysis

• Econometric methods are used in virtually every branch of applied economics

• An empirical analysis uses data to test a theory or to estimate a relationship

• The first step in any empirical analysis: formulating the question of interest – Testing a theory

– Evaluating a policy

– etc.

Econometrics & Economics

• Econometrics should always be linked to economic reasoning

• Exactly how formal and ’tight’ this link is varies across studies

• Researchers setting out to test an economic theory usually begin by developing a formal economic model – by which I mean a set of mathematical equations that describe various relationships

• Once an econometric model has been specified, various hypotheses of interest can be stated in terms of the unknown parameters.

• For example, we may hypothesize that education has no effect on wage; in the context of our model, this would be equivalent to hypothesizing that β1=0. Right?

Causality

• A common goal for applied economists is to estimate the causal effect of one variable on some outcome of interest.

• Ceteris paribus: other relevant factors being equal, what is the effect of…

– a price increase on consumer demand

– training on worker productivity

Causality (cont’d)

If…

a) …we succeed in holding all other relevant determinants of (say) years of working experience constant; and

b) …find a link between years of education and earnings,

…then we can conclude that education has a causal effect on earnings.

Causality (cont’d)

• Ideal setting is experimental: laboratory – administer treatment to half the sample and use the other half as control.

• Much of the research economists do use non-experimental data

• A key challenge in econometrics is to condition on enough other factors, so that a case for causality can be made.

Page 8: Basic Econometrics Welcome! - Karlstad University€¦ · 1 Lecture 1: Introduction Basic Econometrics Iris wang iris.wang@kau.se Welcome! This is the first lecture on the course

8

Causality: Example • Suppose we want to estimate the causal effect of education on wages

• We’ve already seen that the two variables are positively correlated in the WAGE1 data set:

Causality: Example

• This of course doesn’t imply that education causes wages

• Wages are determined by many other factors except education – for example, innate ability

– High ability => high wages

– High ability => high education (e.g. intelligent individuals choose high

education)

• Perhaps the correlation between education and wages visible in the graph is driven by ability rather than education?

• To credibly estimate the causal effect of education, we must find a way of determining the link between education and wages holding innate ability constant!

Page 9: Basic Econometrics Welcome! - Karlstad University€¦ · 1 Lecture 1: Introduction Basic Econometrics Iris wang iris.wang@kau.se Welcome! This is the first lecture on the course

1

Chapters 2&3: The Simple Linear Regression Model

Iris Wang

[email protected]

Two-Variable Regression Model (Chapters 2&3)

Suppose we want to ”explain y in terms of x”. Three issues:

1. Since there’s never an exact relationship between two variables: how allow for other factors affecting y?

2. What is the functional form?

3. How can we be sure we are capturing a ceteris paribus (causal) relationship between y and x (if that is the goal)?

The simple linear regression model

Assume that, in the population, outcome variable y can be modeled as a function of x as follows:

u: error term; disturbance term; residual; noise β0, β1: parameters, coefficients, constants

Simple regression: The functional relationship between y and x is linear:

• β1 is the slope parameter – a parameter of primary interest in applied economics

• The intercept parameter β0 (sometimes called the constant term) is rarely central to an analysis.

Examples:

• How interpret β1 in these equations?

• How can we hope to learn about the ceteris paribus effect of x (education) on y (wage), holding other factors fixed, when we are ignoring all those other factors?

The regression of Y on X

Population Regression Line (PRL)

Page 10: Basic Econometrics Welcome! - Karlstad University€¦ · 1 Lecture 1: Introduction Basic Econometrics Iris wang iris.wang@kau.se Welcome! This is the first lecture on the course

2

Distribution of Y given X = x3

Conditional mean

Years of education

Implication:

Model:

Assumption:

• E(y|x) is the Population Regression Function (PRF). • It is a linear function of x.

The Sample Regression Function (SRF)

• Example: p.43

• Sample and population regression lines

Figure 2.5 p.45

Chapter 3: OLS Estimation • Why is this estimator called the Ordinary Least Squares

(OLS) estimator?

• To see why, first define a fitted value for y when x=xi as

• Next, define the residual for observation i as

Note that there are n such residuals.

• The OLS estimates minimize the sum of squared residuals:

Least squares…

The method of Ordinary Least Squares

• is as small as possible.

• P.58 (3.1.6) & (3.1.7), more details see Appendix 3A

where and

Some related concepts…

• The OLS regression line (or, the sample regression function; SRF):

• Interpretation:

Page 11: Basic Econometrics Welcome! - Karlstad University€¦ · 1 Lecture 1: Introduction Basic Econometrics Iris wang iris.wang@kau.se Welcome! This is the first lecture on the course

3

The statistical properties of OLS estimators

• Three important properties:

1. The OLS estimators are easily computed.

2. They are point estimators ( In Chap. 5 we will intriduce interval estimators).

3. The OLS estimates are obtained from the sample data and the sample regression line has the following properties:

The statistical properties of OLS estimators

I. is always on the OLS regression line. P. 59

II.

III.

IV.

V.

The Assumptions underlying the method of least squeares

Classical Linear Regression Model (CLRM)

makes 7/10 assumptions, p.62/p.315.

The meaning of ”linear” regression

• Final point: Linear regression means linear in parameters.

• This would thus be a linear regression model:

y = b0 + b1*sqrt(x) + u

while this would not:

y = 1/(b0 + b1*x) + u

• Nonlinear relationships between y and x can often be allowed for within the linear regression framework (Exercises)

Assumptions underlyingOLS

• Assumption 1: The regression model is linear in parameters: y = β0 + β1x + u.

• Assumption 3: Zero conditional mean - the error ui

has an expected value of zero given xi:

E(u i|xi)=0

In deciding when the simple linear regression is going to produce unbiased estimators, Assumption 3 is crucial.

• Assumption 4: Homoskedasticity. The error u has the same variance given any value of the explanatory variable:

Technical point: Combined with E(u|x)=0, this implies

• If the variance of u depends on x, the error term is said to exhibit heteroskedasticity

Page 12: Basic Econometrics Welcome! - Karlstad University€¦ · 1 Lecture 1: Introduction Basic Econometrics Iris wang iris.wang@kau.se Welcome! This is the first lecture on the course

4

Homoscedasticity Heteroskedasticity:

• Assumption 7: There is sample variation in the explanatory variable (the x are not all the same).

3.3 variances of the OLS estimates

• We need to keep in mind that and

are not the population parameters β0 , β1

• The estimates are based on a random sample.

• Different random samples would give rise to different estimates of and .

• Therefore, and follow distributions.

We are interested in the means and the variance of these.

Variance of the OLS estimators

• In addition to knowing that the sampling distribution of the OLS estimator is centered on β1, it is important to know how far we can expect to be away from β1 on average.

• Of course, we want to use an estimator for which is expected to be as close to β1 on average, as possible.

So we need to measure the spread (dispersion) of . Key measure: the variance.

Sampling variance of the OLS estimator

• Under Assumptions of OLS,

• This shows that a low residual variance and a high degree of variability in the explanatory variable x contributes to a low variance of .

Page 13: Basic Econometrics Welcome! - Karlstad University€¦ · 1 Lecture 1: Introduction Basic Econometrics Iris wang iris.wang@kau.se Welcome! This is the first lecture on the course

5

Estimating the variance

• Note that σ2 is an unknown parameter.

• If we want to estimate the variance of we need to estimate σ2 . This is done based on the OLS residuals.

• Once we have estimated the variance, we can construct confidence intervals and derive test statistics etc. that will enable us to do inference. We come back to this in Chapter 5.

3.4 The Gauss-Markov Theorem

• Theorem: Under assumptions, OLS is the Best Linear Unbiased Estimator (BLUE) of the population parameters. P.72

• Best = smallest variance

• It’s reassuring to know that, under assumptions, you cannot find a better estimator than OLS.

• If one or several of these assumptions fail, OLS is no longer BLUE.

SRF

3.5 Goodness of Fit Interpretation:

Breaking y into two parts

= systematic part of y + unsystematic part of y

Towards a goodness of fit measure

• For each i, write

• Thus, we can view OLS as decomposing yi into two parts:

– A fitted value

– And a residual

where the fitted values and residuals are uncorrelated in the sample.

TSS, ESS and RSS

• TSS = Total Sum of Squares

• This is simply the total sample variation in yi:

Page 14: Basic Econometrics Welcome! - Karlstad University€¦ · 1 Lecture 1: Introduction Basic Econometrics Iris wang iris.wang@kau.se Welcome! This is the first lecture on the course

6

TSS, ESS and RSS

• ESS = Explained Sum of Squares

• This is the sample variation in :

• RSS = Residual Sum of Squares

• This is the sample variation in the OLS residual

• Recall the OLS decompositon introduced earlier:

Using the earlier result that the fitted values and residuals are uncorrelated in the sample, we can show:

• This is the R-squared of the regression, sometimes called the coefficient of determination.

How interpret the R-squared?

• Answer: As the fraction of the sample

variation in y that is explained by x.

• Note: R-sq always between 0 and 1.

• Extreme cases: R-sq = 0; R-sq = 1. In words,

what do these cases mean?

Additional points on R-sq • Low R-squareds in regressions are not uncommon. In

particular, if we are working with noisy cross-sectional micro data, we often get R-sq lower than (say) 0.20.

• In time series econometrics, we often get high R-squareds. Even simple models like y(t) = a + b*y(t-1) + u tend to give high R-sq.

• For forecasting models, having a good fit (high R-sq) is of course very central.

• But maximizing the R-sq is not a goal in most empirical studies.

• We, therefore, do not need to put too much weight on the size of the R-squared in evaluation regression equations.

TSS, ESS, RSS in the SPSS regression output

• These quantities are primarily used to calculate a measure of the goodness-of-fit of our model.

ESS

RSS

TSS

RSS/(n-1)

ANOVAb

Model

Sum of

Squares df Mean Square F Sig.

1 Regression 1175,711 1 1175,711 102,879 ,000a

Residual 5976,870 523 11,428

Total 7152,581 524

a. Predictors: (Constant), educ

b. Dependent Variable: wage

R-squared in the SPSS regression output

Model Summary

Model R R Square Adjusted R Square Std. Error of the Estimate

1 ,405a ,164 ,163 3,38054

a. Predictors: (Constant), educ

Page 15: Basic Econometrics Welcome! - Karlstad University€¦ · 1 Lecture 1: Introduction Basic Econometrics Iris wang iris.wang@kau.se Welcome! This is the first lecture on the course

7

Units of measurement

• General point: Changing the units of measurement does not change the interpretation of the results (but of course it may change the estimates)

• It is very important to be clear on the units of measurement – otherwise we can’t interpret the results!!

• Does the R-squared change as a result of changing the units of measurement?

Keep in mind

• Unbiasedness is a feature of the sampling distributions of the OLS estimators.

• Says nothing about the estimate that we get for a given sample.

• If the sample we obtain is somehow ”typical”, our estimates should be ”near” the population values.

• But we will never know for sure.

How can unbiasedness fail?

• Unbiasedness fails if any of Assumptions fail (linearity,

random sampling, variation in x, zero conditional mean).

• In practice, Assumption 3 is the most important one.

• The possibility that x is correlated with u is very often a

concern in simple regression analysis with

nonexperimental data (e.g. recall discussion about wage

and education; omitted variables correlated with

education)

THANK YOU!