Dr. Matteo Tanadini matteo.tanadini@math.ethz.ch Herbst ... · Assessing normality...

Angewandte statistische Regression I

Dr. Matteo Tanadinimatteo.tanadini@math.ethz.ch

Herbst Semester 2019 (ETHZ)

4. Vorlesung Angewandte statistische Regression I 1 / 40

Outline

1 Goals

2 Introductory example

3 Errors and residuals

4 Assessing the assumptions about the errors

5 Assessing a real case

6 Residual analysis: Why should I do that?

7 Summary

Section 1

In this lecture you will learn ...

which assumptions are being made when fitting a linear model

how to best test these assumptions

why doing a residual analysis is good practice & what the potentialgains are

Section 2

Introductory example

Simple example

Given a linear model:

y = β0 + β1 · x1 + β2 · x2 + . . .+ βp · xp + ε

The inference on the regression coefficients (p-values and CIs) is based onthe following assumptions about the errors.

εiid∼ N (0, σ2)

Simple example

For simplicity we use a very simple linear model based on simulated data.set.seed(1)

N <- 20

x <- 1:N

errors <- rnorm(n = N, sd = 7)

y <- x * 2 + errors

plot(x, y, panel.first = grid(lty = 2, lwd = 0.5))

abline(a = 0, b = 2, col = "red")

5 10 15 20

# abline(lm(y ~ x), col = ’red’)

y = β0 + β1 · x + ε

β0 = 0

β1 = 2

εiid∼ N (0, 72)

Assumptions

1 The errors follow a normal distribution

2 The errors expected value is zero

3 The errors are homoscedastic (constant variance)

4 The errors are independent

Assumptions: Normality

εiid∼ N (0, σ2)

Draw the symmetric ”bell” shapes

5 10 15 20

Draw the case of asymmetric errors

5 10 15 20

Assumptions: Expected error on zero

εiid∼ N (0, σ2)

Draw the errors

5 10 15 20

Same for a non-linear relationship

5 10 15 20

Assumptions: Homoscedasticity

εiid∼ N (0, σ2)

Draw the errors

5 10 15 20

Draw heteroscedastic errors

5 10 15 20

Assumptions: Independence

εiid∼ N (0, σ2)

Observations must be independent!

If data is somehow grouped (e.g. repeated measures), acorresponding predictor must be included in the model.

Observations can also happen to be correlated in time and space.Extensions of the linear model exist to deal with these cases.

Influential observations

abline(a = 0, b = 2, col = "red")

abline(lm(y ~ x), col = "green", lty = "dashed")

0 5 10 15 20 25 30

True regression lineEstimated regression line

Large error

0 5 10 15 20 25 30

Estimated regression lineRegression line WITH new

observation

Extreme x-value

0 5 10 15 20 25 30

observation

Large error AND extreme x-value!!!

0 5 10 15 20 25 30

observation

There is no formal assumption about influential observations.

However, linear models are sensitive to them and therefore it is goodpractice to check for them.

”Blindly” removing extreme x-values is bad practice.

Observations that are not clear mistakes should NOT be removedfrom the analysis.

Methods that can deal with influential observations exist.

Section 3

Errors and residuals

Errors vs residuals

5 10 15 20

Errors

True regression line

5 10 15 20

Residuals

Estimated regression line

Errors = yobs − ytrue

Residuals = yobs − yest

yobs = β0 + β1 · x + ε

ytrue = β0 + β1 · x

β0 = 0 β1 = 2

yest = y = β0 + β1 · x

β0 = −0.25 β1 = 2.15

Errors and residuals are not the same thing.

Most often only residuals are available.

Residuals are an ”approximation” of the errors.

Model checking is based on residuals.

( Errors and residuals differ in some mathematical properties. )

Section 4

Assessing the assumptions about the errors

Assessing the assumptions

We look at the ”cats” data set and use males only.

2.0 2.5 3.0 3.5

Heart weight vs Body weight (Males only)

Hwt = β0 + βBwt · Bwt + ε

εiid∼ N (0, σ2)

Estimate Std. Error t value Pr(>|t|)

(Intercept) -1.2 1.00 -1.2 2.4e-01

Bwt 4.3 0.34 12.7 3.6e-22

Assessing normality

εiid∼ N (0, σ2)

Histogram of the residuals 'Cats LM'

−4 −2 0 2 4

Assessing normality

Quantile-Quantile plots, QQ-plots for short, make it easier to spotdeviations from normality.

−2 −1 0 1 2

Normal Q−Q Plot

Theoretical Quantiles

How much deviation can be ”tolerated”?

Assessing normality

Let’s simulate normally distributed errors and redo the QQ-plots severaltimes.

−2 −1 0 1 2−

Normal Q−Q Plot

−2 −1 0 1 2

Normal Q−Q Plot

−2 −1 0 1 2

Normal Q−Q Plot

−2 −1 0 1 2

Normal Q−Q Plot

−2 −1 0 1 2−

Normal Q−Q Plot

−2 −1 0 1 2

Normal Q−Q Plot

−2 −1 0 1 2

Normal Q−Q Plot

−2 −1 0 1 2

Normal Q−Q Plot

−2 −1 0 1 2−

Normal Q−Q Plot

Is the QQ-plot for the observed residuals ”different” from the simulatedones?

Assessing normality

The {plgraphics} package is helpful to decide on how much ”deviation”can be tolerated.

theoretical quantiles−2 −1 0 1 2

Assessing Error on zero

Are residuals ”on zero”? ⇒ Plot them against the predictor.

2.0 2.5 3.0 3.5

Hearth weight vs Body weight (Males only)

−2.5

2.0 2.5 3.0 3.5

Example where this assumption is not fulfilled: A non-linear relationship

Apr May Jun Jul Aug Sep

MeasurementDate_AsDate

Girth growth

Example where this assumption is not fulfilled: A non-linear relationship

Girth growth

Assessing Homoscedasticity

Note: back on ”cats” data

Is the variance of the residuals constant?

Plot the absolute residuals against the predictor1

2.0 2.5 3.0 3.5

fitted value10 15

Hwt ~ Bwt

1Often the absolute residuals are then also square-root-transformed.4. Vorlesung Angewandte statistische Regression I 29 / 40

Assessing Independence

All design variables MUST be contained in the model (e.g. person,block, ...)

( Time correlation can be tested with the autocorrelation functions(type ?acf or ?pacf in R) )

( space correlation can be tested with the autocorrelation functions(type ?variog after loading the package {geoR} in R) )

by plotting the residuals against time or over space it can be testedgraphically whether the independence assumption is fulfilled

Assessing influential observations

See the file ”LM ResidualAnalyis Lab.pdf”

Section 5

Assessing a real case

See the file ”LM ResidualAnalyis Lab.pdf” for a complete residual analysis(model with several predictors)

Section 6

Residual analysis: Why should I do that?

Why should I do that?

”Why should I perform model diagnostics?”

Mainly for two reasons:

To assess whether the model assumptions are fulfilled (i.e. inferencecan be trust)

To get more information out of your data

More information: The ”Bees” example:

fitted(fit1)

0 5000 10000 15000

●●

● ●

●●

●● ●●

●●

”Again these ... fA Aing residuals. Why do we bother?!?”

Observed

fitted values

0 5000 10000 15000

●●

● ●

●●

●● ●●

●●

Section 7

Summary

Residual analysis is essential to:

draw valid inference (p-values and Confidence Intervals)

better understand data (insights)

better models ⇒ draw better conclusions

Summary

4 main plots for residual analysisI Residuals vs fitted plot: model equation, (constant variance, outliers,

influential obs)I Scale-location Plot: constant variance, (outliers, influential obs)I Residuals vs leverage plot: influential obsI Normal QQ-plot: normality

Other diagnostic plotsI Residuals vs predictor plots: non-linearities, missing interactions (i.e.

model equation), constant varianceI Residuals vs time: time correlationI Residuals over space: space correlationI ACF, PACF and Variograms: time and space correlation

Dr. Matteo Tanadini matteo.tanadini@math.ethz.ch Herbst ... · Assessing normality...

Documents

Assessing Normality { The Univariate Casemaitra/stat501/lectures/AssessingNormality... · Assessing Normality { The Univariate Case In general, most multivariate methods will depend

GfK at Bricoday: new normality?

PREPARING LABORATORY SOLUTIONS AND … PREPARING LABORATORY SOLUTIONS AND ... Molality 2. Molarity 3. Normality 4. Percent a. ... molarity = normality, however, this is not

Régression quantile non paramétrique localement linéairemaths-excellence.com/wp-content/uploads/2013/12/GS... · 2. Deux approches non paramétriques de la régression quantile

UNCONDITIONAL QUANTILE REGRESSIONSerp/Econometrics/Old Pdfs/firpo.pdfUNCONDITIONAL QUANTILE REGRESSIONS* Sergio Firpo Nicole M. Fortin Departamento de Economia Department of Economics

Bode plots applied to microscopic interferometry · Bode plots applied to microscopic interferometry ... taking advantage of the vertical depth discrimination of the ... Bode plots

Quantile. Median bei Klassenbildung Formel Quantile bei Klassenbildung wobei aber

Bode Plots - Nilsson(1)

Carpet Plots Matthias Oberhauser

Line Plots

Godrej Woodland | Top plots 29.9L onwards | 1200+ residential plots | Sarjapur, Bangalore

Gulawali Plots power point

FRANCO 3000 plots - axter.eu · PLOTS Gamme plots et accessoires La gamme de plots et accessoires Axter permet l’installation complète de toitures-terrasses accessibles, finition

Yamuna Vihar Plots

Assessing Normality { The Univariate Casemaitra/stat501/lectures/AssessingNormality.pdf · Assessing Normality { The Univariate Case In general, most multivariate methods will depend

people.bu.edupeople.bu.edu/qu/quantile-process/process-0612.pdf · Nonparametric Estimation and Inference on Conditional Quantile Processes Zhongjun Quy Boston University Jungmo Yoonz

Quantile Regression

Тест нормальности распределения - Normality test

Censored Quantile Regression and Survival Models

Terra 2 city plots