Dr. Matteo Tanadini matteo.tanadini@math.ethz.ch Herbst ... · Assessing normality...

Preview:

Citation preview

Angewandte statistische Regression I

Dr. Matteo Tanadinimatteo.tanadini@math.ethz.ch

Herbst Semester 2019 (ETHZ)

4. Vorlesung Angewandte statistische Regression I 1 / 40

Outline

1 Goals

2 Introductory example

3 Errors and residuals

4 Assessing the assumptions about the errors

5 Assessing a real case

6 Residual analysis: Why should I do that?

7 Summary

4. Vorlesung Angewandte statistische Regression I 2 / 40

Section 1

Goals

4. Vorlesung Angewandte statistische Regression I 3 / 40

Goals

In this lecture you will learn ...

which assumptions are being made when fitting a linear model

how to best test these assumptions

why doing a residual analysis is good practice & what the potentialgains are

4. Vorlesung Angewandte statistische Regression I 4 / 40

Section 2

Introductory example

4. Vorlesung Angewandte statistische Regression I 5 / 40

Simple example

Given a linear model:

y = β0 + β1 · x1 + β2 · x2 + . . .+ βp · xp + ε

The inference on the regression coefficients (p-values and CIs) is based onthe following assumptions about the errors.

εiid∼ N (0, σ2)

4. Vorlesung Angewandte statistische Regression I 6 / 40

Simple example

For simplicity we use a very simple linear model based on simulated data.set.seed(1)

N <- 20

x <- 1:N

errors <- rnorm(n = N, sd = 7)

y <- x * 2 + errors

##

plot(x, y, panel.first = grid(lty = 2, lwd = 0.5))

abline(a = 0, b = 2, col = "red")

5 10 15 20

010

2030

40

x

y

# abline(lm(y ~ x), col = ’red’)

y = β0 + β1 · x + ε

β0 = 0

β1 = 2

εiid∼ N (0, 72)

4. Vorlesung Angewandte statistische Regression I 7 / 40

Assumptions

1 The errors follow a normal distribution

2 The errors expected value is zero

3 The errors are homoscedastic (constant variance)

4 The errors are independent

4. Vorlesung Angewandte statistische Regression I 8 / 40

Assumptions: Normality

εiid∼ N (0, σ2)

Draw the symmetric ”bell” shapes

5 10 15 20

010

2030

40

x

y

Draw the case of asymmetric errors

5 10 15 20

010

2030

40

x

y

4. Vorlesung Angewandte statistische Regression I 9 / 40

Assumptions: Expected error on zero

εiid∼ N (0, σ2)

Draw the errors

5 10 15 20

010

2030

40

x

y

Same for a non-linear relationship

5 10 15 20

010

2030

40

x

y

4. Vorlesung Angewandte statistische Regression I 10 / 40

Assumptions: Homoscedasticity

εiid∼ N (0, σ2)

Draw the errors

5 10 15 20

010

2030

40

x

y

Draw heteroscedastic errors

5 10 15 20

010

2030

40

x

y

4. Vorlesung Angewandte statistische Regression I 11 / 40

Assumptions: Independence

εiid∼ N (0, σ2)

Observations must be independent!

If data is somehow grouped (e.g. repeated measures), acorresponding predictor must be included in the model.

Observations can also happen to be correlated in time and space.Extensions of the linear model exist to deal with these cases.

4. Vorlesung Angewandte statistische Regression I 12 / 40

Influential observations

abline(a = 0, b = 2, col = "red")

abline(lm(y ~ x), col = "green", lty = "dashed")

0 5 10 15 20 25 30

020

4060

80

x

y

True regression lineEstimated regression line

Large error

0 5 10 15 20 25 30

020

4060

80

xy

Estimated regression lineRegression line WITH new

observation

4. Vorlesung Angewandte statistische Regression I 13 / 40

Influential observations

Extreme x-value

0 5 10 15 20 25 30

020

4060

80

x

y

Estimated regression lineRegression line WITH new

observation

Large error AND extreme x-value!!!

0 5 10 15 20 25 30

020

4060

80

xy

Estimated regression lineRegression line WITH new

observation

4. Vorlesung Angewandte statistische Regression I 14 / 40

Influential observations

There is no formal assumption about influential observations.

However, linear models are sensitive to them and therefore it is goodpractice to check for them.

”Blindly” removing extreme x-values is bad practice.

Observations that are not clear mistakes should NOT be removedfrom the analysis.

Methods that can deal with influential observations exist.

4. Vorlesung Angewandte statistische Regression I 15 / 40

Section 3

Errors and residuals

4. Vorlesung Angewandte statistische Regression I 16 / 40

Errors vs residuals

5 10 15 20

010

2030

40

Errors

x

y

True regression line

5 10 15 20

010

2030

40

Residuals

xy

Estimated regression line

4. Vorlesung Angewandte statistische Regression I 17 / 40

Errors and residuals

Errors = yobs − ytrue

Residuals = yobs − yest

yobs = β0 + β1 · x + ε

ytrue = β0 + β1 · x

β0 = 0 β1 = 2

yest = y = β0 + β1 · x

β0 = −0.25 β1 = 2.15

4. Vorlesung Angewandte statistische Regression I 18 / 40

Errors and residuals

Errors and residuals are not the same thing.

Most often only residuals are available.

Residuals are an ”approximation” of the errors.

Model checking is based on residuals.

( Errors and residuals differ in some mathematical properties. )

4. Vorlesung Angewandte statistische Regression I 19 / 40

Section 4

Assessing the assumptions about the errors

4. Vorlesung Angewandte statistische Regression I 20 / 40

Assessing the assumptions

We look at the ”cats” data set and use males only.

2.0 2.5 3.0 3.5

68

1012

1416

1820

Heart weight vs Body weight (Males only)

Bwt

Hw

t

Hwt = β0 + βBwt · Bwt + ε

εiid∼ N (0, σ2)

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.2 1.00 -1.2 2.4e-01

Bwt 4.3 0.34 12.7 3.6e-22

4. Vorlesung Angewandte statistische Regression I 21 / 40

Assessing normality

εiid∼ N (0, σ2)

Histogram of the residuals 'Cats LM'

Den

sity

−4 −2 0 2 4

0.00

0.05

0.10

0.15

0.20

0.25

4. Vorlesung Angewandte statistische Regression I 22 / 40

Assessing normality

Quantile-Quantile plots, QQ-plots for short, make it easier to spotdeviations from normality.

−2 −1 0 1 2

−4

−2

02

4

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

How much deviation can be ”tolerated”?

4. Vorlesung Angewandte statistische Regression I 23 / 40

Assessing normality

Let’s simulate normally distributed errors and redo the QQ-plots severaltimes.

−2 −1 0 1 2−

4−

20

24

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−2 −1 0 1 2

−2

02

4

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−2 −1 0 1 2

−2

02

4

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−2 −1 0 1 2

−4

−2

02

4

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−2 −1 0 1 2−

4−

20

2

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−2 −1 0 1 2

−3

−1

12

3

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−2 −1 0 1 2

−2

02

4

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−2 −1 0 1 2

−3

−1

12

3

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

−2 −1 0 1 2−

3−

11

23

Normal Q−Q Plot

Theoretical Quantiles

Sam

ple

Qua

ntile

s

Is the QQ-plot for the observed residuals ”different” from the simulatedones?

4. Vorlesung Angewandte statistische Regression I 24 / 40

Assessing normality

The {plgraphics} package is helpful to decide on how much ”deviation”can be tolerated.

theoretical quantiles−2 −1 0 1 2

st.s

m.r

es_H

wt

−2

02

4. Vorlesung Angewandte statistische Regression I 25 / 40

Assessing Error on zero

Are residuals ”on zero”? ⇒ Plot them against the predictor.

2.0 2.5 3.0 3.5

68

1012

1416

1820

Hearth weight vs Body weight (Males only)

Bwt

Hw

t

−2.5

0.0

2.5

5.0

2.0 2.5 3.0 3.5

Bwt

resi

dual

s

4. Vorlesung Angewandte statistische Regression I 26 / 40

Assessing Error on zero

Example where this assumption is not fulfilled: A non-linear relationship

75

80

85

90

95

Apr May Jun Jul Aug Sep

MeasurementDate_AsDate

Gir

th [c

m]

Girth growth

80

90

100

Apr May Jun Jul Aug Sep

MeasurementDate_AsDate

Gir

th [c

m]

Girth growth

4. Vorlesung Angewandte statistische Regression I 27 / 40

Assessing Error on zero

Example where this assumption is not fulfilled: A non-linear relationship

75

80

85

90

95

100

Apr May Jun Jul Aug Sep

MeasurementDate_AsDate

Gir

th [c

m]

Girth growth

−5

0

5

Apr May Jun Jul Aug Sep

MeasurementDate_AsDate

Res

idua

ls

4. Vorlesung Angewandte statistische Regression I 28 / 40

Assessing Homoscedasticity

Note: back on ”cats” data

Is the variance of the residuals constant?

Plot the absolute residuals against the predictor1

0.0

0.5

1.0

1.5

2.0

2.0 2.5 3.0 3.5

Bwt

sqrt

(abs

(res

.lm.c

ats.

M))

fitted value10 15

|st.s

m.r

es_H

wt|

01

2

Hwt ~ Bwt

1Often the absolute residuals are then also square-root-transformed.4. Vorlesung Angewandte statistische Regression I 29 / 40

Assessing Independence

All design variables MUST be contained in the model (e.g. person,block, ...)

( Time correlation can be tested with the autocorrelation functions(type ?acf or ?pacf in R) )

( space correlation can be tested with the autocorrelation functions(type ?variog after loading the package {geoR} in R) )

by plotting the residuals against time or over space it can be testedgraphically whether the independence assumption is fulfilled

4. Vorlesung Angewandte statistische Regression I 30 / 40

Assessing influential observations

See the file ”LM ResidualAnalyis Lab.pdf”

4. Vorlesung Angewandte statistische Regression I 31 / 40

Section 5

Assessing a real case

4. Vorlesung Angewandte statistische Regression I 32 / 40

Assessing a real case

See the file ”LM ResidualAnalyis Lab.pdf” for a complete residual analysis(model with several predictors)

4. Vorlesung Angewandte statistische Regression I 33 / 40

Section 6

Residual analysis: Why should I do that?

4. Vorlesung Angewandte statistische Regression I 34 / 40

Why should I do that?

”Why should I perform model diagnostics?”

Mainly for two reasons:

To assess whether the model assumptions are fulfilled (i.e. inferencecan be trust)

To get more information out of your data

4. Vorlesung Angewandte statistische Regression I 35 / 40

Why should I do that?

More information: The ”Bees” example:

fitted(fit1)

abs(

resi

d(fit

1))

0

1000

2000

3000

4000

0 5000 10000 15000

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●●

●●

●●

●●

4. Vorlesung Angewandte statistische Regression I 36 / 40

Why should I do that?

”Again these ... fA Aing residuals. Why do we bother?!?”

Observed

fitted values

abs.

res

idua

ls

0

1000

2000

3000

4000

0 5000 10000 15000

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●●

●●

●●

●●

4. Vorlesung Angewandte statistische Regression I 37 / 40

Section 7

Summary

4. Vorlesung Angewandte statistische Regression I 38 / 40

Summary

Residual analysis is essential to:

draw valid inference (p-values and Confidence Intervals)

better understand data (insights)

better models ⇒ draw better conclusions

4. Vorlesung Angewandte statistische Regression I 39 / 40

Summary

4 main plots for residual analysisI Residuals vs fitted plot: model equation, (constant variance, outliers,

influential obs)I Scale-location Plot: constant variance, (outliers, influential obs)I Residuals vs leverage plot: influential obsI Normal QQ-plot: normality

Other diagnostic plotsI Residuals vs predictor plots: non-linearities, missing interactions (i.e.

model equation), constant varianceI Residuals vs time: time correlationI Residuals over space: space correlationI ACF, PACF and Variograms: time and space correlation

4. Vorlesung Angewandte statistische Regression I 40 / 40

Recommended