24
Diagnostyka modelu liniowego Jan Mielniczuk 11 kwietnia 2017 Jan Mielniczuk

- Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

Diagnostyka modelu liniowego

Jan Mielniczuk

11 kwietnia 2017

Jan Mielniczuk

Page 2: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

Modelowanie zależności zmiennych ilościowych

Objętość drzew jest zależna od ich wysokości i grubości. Od której z tychwielkości bardziej ? Jak to modelować? Rozpatrzy zbiór danych treeszawierający pomiary objętości drzew (nazwa zmiennej : Volume),pierśnicy (Girth, zdefiniowanej jako średnica na wysokości 1.3 m) orazwysokości(Height). Postać zbioru danych (początek i jego koniec) ( Rpackage , http://cran.r-project.org)

Girth Height Volume1 8.3 70 10.32 8.6 65 10.33 8.8 63 10.2......................30 18.0 80 51.031 20.6 87 77.0

Szukamy zależności (jeśli występuje)

objętości od pierśnicy i wysokości.

Będziemy starali się wyjaśnić zmieność objętości przy pomocyzmienności pierśnicy i/albo wysokości. Role zmiennych niesymetryczne!Volume – zmienna odpowiedzi (zmienna zależna) Girth – zmiennaniezależna (predyktor, regresor, zmienna niezależna)

Jan Mielniczuk

Page 3: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

Główne narzędzie oceny zależności:Wykres rozproszenia Volume względem Girth i względem Height.

Komenda: pairs(trees)

Girth

65 70 75 80 85

810

1214

1618

20

6570

7580

85

Height

8 10 12 14 16 18 20

10 20 30 40 50 60 70

1020

3040

5060

70

Volume

Jan Mielniczuk

Page 4: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

We fit multiple linear model to data trees with attributes Girth iHeight as predictors

> trees.lm=lm(Volume ~.,data=trees)# ~. means w.r.t. all predictors> summary(trees.lm)

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***Girth 4.7082 0.2643 17.816 < 2e-16 ***Height 0.3393 0.1302 2.607 0.0145 *

Residual standard error: 3.882 on 28 degrees of freedomMultiple R-Squared: 0.948, Adjusted R-squared: 0.9442

The considered model explained 95% of variability of Y , but Volume ∼Girth 94% of variability of Y , does one need Height ?

Jan Mielniczuk

Page 5: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

F test for nested models

F test can be also used for testing H0 : ω is adequate against H1: Ω isadequate and ω is not, where ω ⊂ Ω are two nested models containing qand p attributes, respectively. Then if H0 is true

F =(SSEω − SSEΩ

p − q

)/(SSEΩ

n − p

)∼ Fp−q,n−p

Jan Mielniczuk

Page 6: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

.

We fit model Volume ∼ Girth and Volume ∼ Girth+ Height andcompare them using F test( function anova(trees.lm,trees1.lm,test=”F”))

Model 1: Volume ~ Girth + HeightModel 2: Volume ~ GirthRes.Df RSS Df Sum of Sq F Pr(>F)1 28 421.922 29 524.30 -1 -102.38 6.7943 0.01449 *

Height is significant in the model already containing Girth !(p-value is the same as p-value of a t- test in the modelVolume ∼ Girth+ Height, when more than one variable is added, thetests are different)

Jan Mielniczuk

Page 7: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

We fit different model, modeling a tree as cone. Then

Volume ∼ Girth2 × Height,

and thus we can fit linear model for logarithms of original attributes.

summary(lm(log(Volume)~log(Girth)+log(Height),data=trees))

Estimate Std. Error t value Pr(>|t|)(Intercept) -6.63162 0.79979 -8.292 5.06e-09 ***log(Girth) 1.98265 0.07501 26.432 < 2e-16 ***log(Height) 1.11712 0.20444 5.464 7.81e-06 ***

Residual standard error: 0.08139 on 28 degrees of freedomMultiple R-Squared: 0.9777, Adjusted R-squared: 0.9761

Jan Mielniczuk

Page 8: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

We cannot compare these two models using F statistics as they are notnested. However:

value of F statistic is much larger for the second model (613.2 vs255.2);

value of t statistic for log Height is much larger than for Height.

This indirectly supports the model based on logarithms.

ExampleData fitness concern parameters on the 1,5 mile run.Oxygen - oxygen intake (w ml per kg of weight and minute)Age - age (yrs)Weight - weight ( kg)RunTime - running time of 1,5 mile (min)RestPulse - rest pulseRunPulse - mean run pulseMaxPulse - maximal run pulseDependent variable is Oxygen.

Jan Mielniczuk

Page 9: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

Call: lm(formula = Oxygen ~ ., data = fitness)

Residuals:Min 1Q Median 3Q Max

-5.40256 -0.89908 0.07063 1.04964 5.38469

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 102.93448 12.40326 8.299 1.64e-08 ***Age -0.22697 0.09984 -2.273 0.03224 *Weight -0.07418 0.05459 -1.359 0.18687RunTime -2.62865 0.38456 -6.835 4.54e-07 ***RestPulse -0.02153 0.06605 -0.326 0.74725RunPulse -0.36963 0.11985 -3.084 0.00508 **MaxPulse 0.30322 0.13650 2.221 0.03601 *

Residual standard error: 2.317 on 24 degrees of freedomMultiple R-Squared: 0.8487, Adjusted R-squared: 0.8108F-statistic: 22.43 on 6 and 24 DF, p-value: 9.715e-09

Jan Mielniczuk

Page 10: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

Assuming the model fits we reject the hypothesis that all variablesare insignificant;

(p−value of F is 9.7× 10−9), considered model explains 84% ofvariability of oxygen intake;

Smallest p− values of t statistics correspond to variables Age andRunTime;

Each of variables Weight and RestPulse does not contribute toexplaining variability of Y when remaining variables are included inthe model .

Jan Mielniczuk

Page 11: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

General linear hypothesis

Example 26 subjects tested for the following problem: effect of exerciseactivity and weight (attribute x) on HDL cholesterol level.

First group -control (8 subjects)

Second group - running programme (8 subjects)

Second group - running & weight lifting programme (10 subjects)

yi = β0 + β1xi i = 1, . . . , 8

yi = γ0 + γ1xi i = 9, . . . , 16

yi = δ0 + δ1xi i = 17, . . . , 26

We want to test typothesis H0 : β1 = γ1 = δ1. This is not the caseconsidered before. How to do it ?

C =

(0 0 0 1 −1 00 0 0 1 0 −1

)βT =

(β0 γ0 δ0 β1 γ1 δ1

)H0 : β1 − γ1 = 0 & β1 − δ1 = 0 ≡ Cβ = 0

Jan Mielniczuk

Page 12: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

Problem of prediction

We want to predict mean response EY (x0) for a new value of x0 which isdifferent from observed values xi . Obviously

EY (x0) = β′x0.

Value of prediction is Y (x0) = b′x0. It is seen that

EY (x0) = β′x0.

Moreover,σ2Y (x0)

= x′0Σbx0 = σ2x′0(X′X)−1x0.

Standard error is defined as

SEY (x0)= S

√x′0(X′X)−1x0.

Confidence interval for EY (x0) on confidence level 1− α is

Y (x0)± t1−α/2,n−pSEY (x0).

Jan Mielniczuk

Page 13: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

Confidence intervals for mean Volume based on Girth. Wide intervals :CI for a new observation.

8 10 12 14 16 18 20

1020

3040

5060

70

Girth

Vol

ume

Jan Mielniczuk

Page 14: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

Volume Confidence intervals for mean Volume based on Height.

65 70 75 80 85

020

4060

80

Height

Vol

ume

Jan Mielniczuk

Page 15: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

Diagnostics

Example trees cont’d Consider plot of studentized residuals in themodel Volume∼Height + Girth.

8 10 12 14 16 18 20

−1

01

2

Girth

tree

s.re

s

Suggests adding square of Girth to the model.Jan Mielniczuk

Page 16: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

.

> trees1.fit=lm(Volume~Girth +I(Girth*Girth) +Height)> summary(trees1.fit)

lm(formula = Volume ~ Girth + I(Girth * Girth) + Height)Estimate Std. Error t value Pr(>|t|)

(Intercept) -9.92041 10.07911 -0.984 0.333729Girth -2.88508 1.30985 -2.203 0.036343 *I(Girth * Girth) 0.26862 0.04590 5.852 3.13e-06 ***Height 0.37639 0.08823 4.266 0.000218 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Multiple R-squared: 0.9771, Adjusted R-squared: 0.9745F-statistic: 383.2 on 3 and 27 DF, p-value: < 2.2e-16

Jan Mielniczuk

Page 17: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

.

> anova(trees.fit,trees1.fit)Analysis of Variance Table

Model 1: Volume ~ Girth + HeightModel 2: Volume ~ Girth + I(Girth * Girth) + HeightRes.Df RSS Df Sum of Sq F Pr(>F)1 28 421.922 27 186.01 1 235.91 34.243 3.13e-06 ***

(the same p-value as for t-statistic pertaining to Girth2) We have two modelsof comparable quality

Volume ∼ Girth + I(Girth * Girth) + Height

log(Volume)∼ log(Girth) + log(Height).

The second has only two predictors.

Jan Mielniczuk

Page 18: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

Example trees cont’d

Studentized residuals vs Girth for the larger model.

8 10 12 14 16 18 20

−1.

5−

1.0

−0.

50.

00.

51.

01.

5

Girth

tree

s.re

s

Jan Mielniczuk

Page 19: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

Outliers and influential observations

••

••

••

••

• ••

••

1400 1500 1600 1700 1800 1900

500

1000

1500

2000

2500

2

1

I

II

Observation 2 is influential and outlying, observation 1 is neither infuential oroutlying. If we move observation 2 to the left (changing its x value to e.g.1650) it is still outlying but not influential.

Jan Mielniczuk

Page 20: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

−4 −2 0 2 4 6 8 10

−10

12

Huber’s data: L.S. line and influential obs.

xh

yh

Prosta : y = 0.06833 − 0.08146*x (wszystkie obs.)p−wartosc = 0.581

Prosta : y = −1.8720 − 0.9770*x ( bez obserw. 6)p−wartosc = 0.0114

Prosta : y = −0.42 + 0.008899*x ( bez obserw. 1)p−wartosc = 0.936

Jan Mielniczuk

Page 21: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

Obserwacje odstające i wpływowe

50 100 150

02

46

hills.lm$fitted

stu

dre

s(h

ills

.lm

)

Bens of Jura

Knock Hill

0 5 10 20 30

0.0

0.5

1.0

1.5

Indexcooks.d

ista

nce(h

ills

.lm

)

Lairig GhruKnock Hill

Bens of Jura

Jan Mielniczuk

Page 22: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

Przekształcenie odpowiedzi

Data GALAPAGOS. Observations on species of turtles living on Galapagosislands and characteristics of these islands.Species - number of species on an islandEndemics - number of endemic speciesArea - island’s area (in km2)Elevation - elevation of the highest point (m)Nearest - distance to the nearest island (km)Scruz - distance to island Santa Cruz (km)Adjacent - area of the nearest island (km2)

0 100 200 300 400

−10

00

100

Rezydua modelu Species ~ .

Rez

ydua

5 10 15 20

−4

02

4

Rezydua modelu sqrt(Species) ~ .

Rez

ydua

Jan Mielniczuk

Page 23: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

Example - data flow.

TEMP ~ PRESS + DENS + CODE + WGT + AWGT + TIME + PULSES + ULLAGE

Estimate Std. Error t value Pr(>|t|)(Intercept) 2.225e+02 1.048e+01 21.221 < 2e-16PRESS 2.253e-02 5.396e-03 4.175 5.88e-05DENS -2.143e+01 1.578e+00 -13.578 < 2e-16CODE -1.175e-01 4.024e-02 -2.920 0.00422WGT 9.802e-03 6.647e-03 1.475 0.14310AWGT -1.690e-02 2.187e-02 -0.773 0.44108TIME 6.897e-05 3.424e-04 0.201 0.84073PULSES 5.047e-02 1.450e-01 0.348 0.72849ULLAGE 1.655e-04 5.882e-04 0.281 0.77893

Multiple R-Squared: 0.9991, Adjusted R-squared: 0.999F-statistic: 1.586e+04 on 8 and 113 DF, p-value: < 2.2e-16

VIF indices for the full model

PRESS DENS CODE WGT AWGT TIME PULSES ULLAGE79.2596 495.1694 124.912 497.9868 5622.73 1.2134 5826.8656 1.2043

Jan Mielniczuk

Page 24: - Diagnostyka modelu liniowego · Model 1: Volume ~ Girth + Height Model 2: Volume ~ Girth Res.Df RSS Df Sum of Sq F Pr(>F) 1 28 421.92 2 29 524.30 -1 -102.38 6.7943 0.01449 * Height

Remove Pulses and AWGT having the largest value of VIF

PRESS DENS CODE WGT TIME ULLAGE72.5913 177.0520 113.518 1.1909 1.1977 1.1327

and DENS variable

PRESS CODE WGT TIME ULLAGE1.101051 1.2698 1.1766 1.1909 1.0786

VIF values for PRESS i CODE large for the second model, after removing DENSbecome small.

Jan Mielniczuk