Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Diagnostyka modelu liniowego
Jan Mielniczuk
11 kwietnia 2017
Jan Mielniczuk
Modelowanie zależności zmiennych ilościowych
Objętość drzew jest zależna od ich wysokości i grubości. Od której z tychwielkości bardziej ? Jak to modelować? Rozpatrzy zbiór danych treeszawierający pomiary objętości drzew (nazwa zmiennej : Volume),pierśnicy (Girth, zdefiniowanej jako średnica na wysokości 1.3 m) orazwysokości(Height). Postać zbioru danych (początek i jego koniec) ( Rpackage , http://cran.r-project.org)
Girth Height Volume1 8.3 70 10.32 8.6 65 10.33 8.8 63 10.2......................30 18.0 80 51.031 20.6 87 77.0
Szukamy zależności (jeśli występuje)
objętości od pierśnicy i wysokości.
Będziemy starali się wyjaśnić zmieność objętości przy pomocyzmienności pierśnicy i/albo wysokości. Role zmiennych niesymetryczne!Volume – zmienna odpowiedzi (zmienna zależna) Girth – zmiennaniezależna (predyktor, regresor, zmienna niezależna)
Jan Mielniczuk
Główne narzędzie oceny zależności:Wykres rozproszenia Volume względem Girth i względem Height.
Komenda: pairs(trees)
Girth
65 70 75 80 85
810
1214
1618
20
6570
7580
85
Height
8 10 12 14 16 18 20
10 20 30 40 50 60 70
1020
3040
5060
70
Volume
Jan Mielniczuk
We fit multiple linear model to data trees with attributes Girth iHeight as predictors
> trees.lm=lm(Volume ~.,data=trees)# ~. means w.r.t. all predictors> summary(trees.lm)
Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.9877 8.6382 -6.713 2.75e-07 ***Girth 4.7082 0.2643 17.816 < 2e-16 ***Height 0.3393 0.1302 2.607 0.0145 *
Residual standard error: 3.882 on 28 degrees of freedomMultiple R-Squared: 0.948, Adjusted R-squared: 0.9442
The considered model explained 95% of variability of Y , but Volume ∼Girth 94% of variability of Y , does one need Height ?
Jan Mielniczuk
F test for nested models
F test can be also used for testing H0 : ω is adequate against H1: Ω isadequate and ω is not, where ω ⊂ Ω are two nested models containing qand p attributes, respectively. Then if H0 is true
F =(SSEω − SSEΩ
p − q
)/(SSEΩ
n − p
)∼ Fp−q,n−p
Jan Mielniczuk
.
We fit model Volume ∼ Girth and Volume ∼ Girth+ Height andcompare them using F test( function anova(trees.lm,trees1.lm,test=”F”))
Model 1: Volume ~ Girth + HeightModel 2: Volume ~ GirthRes.Df RSS Df Sum of Sq F Pr(>F)1 28 421.922 29 524.30 -1 -102.38 6.7943 0.01449 *
Height is significant in the model already containing Girth !(p-value is the same as p-value of a t- test in the modelVolume ∼ Girth+ Height, when more than one variable is added, thetests are different)
Jan Mielniczuk
We fit different model, modeling a tree as cone. Then
Volume ∼ Girth2 × Height,
and thus we can fit linear model for logarithms of original attributes.
summary(lm(log(Volume)~log(Girth)+log(Height),data=trees))
Estimate Std. Error t value Pr(>|t|)(Intercept) -6.63162 0.79979 -8.292 5.06e-09 ***log(Girth) 1.98265 0.07501 26.432 < 2e-16 ***log(Height) 1.11712 0.20444 5.464 7.81e-06 ***
Residual standard error: 0.08139 on 28 degrees of freedomMultiple R-Squared: 0.9777, Adjusted R-squared: 0.9761
Jan Mielniczuk
We cannot compare these two models using F statistics as they are notnested. However:
value of F statistic is much larger for the second model (613.2 vs255.2);
value of t statistic for log Height is much larger than for Height.
This indirectly supports the model based on logarithms.
ExampleData fitness concern parameters on the 1,5 mile run.Oxygen - oxygen intake (w ml per kg of weight and minute)Age - age (yrs)Weight - weight ( kg)RunTime - running time of 1,5 mile (min)RestPulse - rest pulseRunPulse - mean run pulseMaxPulse - maximal run pulseDependent variable is Oxygen.
Jan Mielniczuk
Call: lm(formula = Oxygen ~ ., data = fitness)
Residuals:Min 1Q Median 3Q Max
-5.40256 -0.89908 0.07063 1.04964 5.38469
Coefficients:Estimate Std. Error t value Pr(>|t|)
(Intercept) 102.93448 12.40326 8.299 1.64e-08 ***Age -0.22697 0.09984 -2.273 0.03224 *Weight -0.07418 0.05459 -1.359 0.18687RunTime -2.62865 0.38456 -6.835 4.54e-07 ***RestPulse -0.02153 0.06605 -0.326 0.74725RunPulse -0.36963 0.11985 -3.084 0.00508 **MaxPulse 0.30322 0.13650 2.221 0.03601 *
Residual standard error: 2.317 on 24 degrees of freedomMultiple R-Squared: 0.8487, Adjusted R-squared: 0.8108F-statistic: 22.43 on 6 and 24 DF, p-value: 9.715e-09
Jan Mielniczuk
Assuming the model fits we reject the hypothesis that all variablesare insignificant;
(p−value of F is 9.7× 10−9), considered model explains 84% ofvariability of oxygen intake;
Smallest p− values of t statistics correspond to variables Age andRunTime;
Each of variables Weight and RestPulse does not contribute toexplaining variability of Y when remaining variables are included inthe model .
Jan Mielniczuk
General linear hypothesis
Example 26 subjects tested for the following problem: effect of exerciseactivity and weight (attribute x) on HDL cholesterol level.
First group -control (8 subjects)
Second group - running programme (8 subjects)
Second group - running & weight lifting programme (10 subjects)
yi = β0 + β1xi i = 1, . . . , 8
yi = γ0 + γ1xi i = 9, . . . , 16
yi = δ0 + δ1xi i = 17, . . . , 26
We want to test typothesis H0 : β1 = γ1 = δ1. This is not the caseconsidered before. How to do it ?
C =
(0 0 0 1 −1 00 0 0 1 0 −1
)βT =
(β0 γ0 δ0 β1 γ1 δ1
)H0 : β1 − γ1 = 0 & β1 − δ1 = 0 ≡ Cβ = 0
Jan Mielniczuk
Problem of prediction
We want to predict mean response EY (x0) for a new value of x0 which isdifferent from observed values xi . Obviously
EY (x0) = β′x0.
Value of prediction is Y (x0) = b′x0. It is seen that
EY (x0) = β′x0.
Moreover,σ2Y (x0)
= x′0Σbx0 = σ2x′0(X′X)−1x0.
Standard error is defined as
SEY (x0)= S
√x′0(X′X)−1x0.
Confidence interval for EY (x0) on confidence level 1− α is
Y (x0)± t1−α/2,n−pSEY (x0).
Jan Mielniczuk
Confidence intervals for mean Volume based on Girth. Wide intervals :CI for a new observation.
8 10 12 14 16 18 20
1020
3040
5060
70
Girth
Vol
ume
Jan Mielniczuk
Volume Confidence intervals for mean Volume based on Height.
65 70 75 80 85
020
4060
80
Height
Vol
ume
Jan Mielniczuk
Diagnostics
Example trees cont’d Consider plot of studentized residuals in themodel Volume∼Height + Girth.
8 10 12 14 16 18 20
−1
01
2
Girth
tree
s.re
s
Suggests adding square of Girth to the model.Jan Mielniczuk
.
> trees1.fit=lm(Volume~Girth +I(Girth*Girth) +Height)> summary(trees1.fit)
lm(formula = Volume ~ Girth + I(Girth * Girth) + Height)Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.92041 10.07911 -0.984 0.333729Girth -2.88508 1.30985 -2.203 0.036343 *I(Girth * Girth) 0.26862 0.04590 5.852 3.13e-06 ***Height 0.37639 0.08823 4.266 0.000218 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Multiple R-squared: 0.9771, Adjusted R-squared: 0.9745F-statistic: 383.2 on 3 and 27 DF, p-value: < 2.2e-16
Jan Mielniczuk
.
> anova(trees.fit,trees1.fit)Analysis of Variance Table
Model 1: Volume ~ Girth + HeightModel 2: Volume ~ Girth + I(Girth * Girth) + HeightRes.Df RSS Df Sum of Sq F Pr(>F)1 28 421.922 27 186.01 1 235.91 34.243 3.13e-06 ***
(the same p-value as for t-statistic pertaining to Girth2) We have two modelsof comparable quality
Volume ∼ Girth + I(Girth * Girth) + Height
log(Volume)∼ log(Girth) + log(Height).
The second has only two predictors.
Jan Mielniczuk
Example trees cont’d
Studentized residuals vs Girth for the larger model.
8 10 12 14 16 18 20
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
Girth
tree
s.re
s
Jan Mielniczuk
Outliers and influential observations
•
•
••
••
••
•
•
•
•
••
• ••
•
•
•
••
•
1400 1500 1600 1700 1800 1900
500
1000
1500
2000
2500
2
1
I
II
Observation 2 is influential and outlying, observation 1 is neither infuential oroutlying. If we move observation 2 to the left (changing its x value to e.g.1650) it is still outlying but not influential.
Jan Mielniczuk
−4 −2 0 2 4 6 8 10
−10
12
Huber’s data: L.S. line and influential obs.
xh
yh
Prosta : y = 0.06833 − 0.08146*x (wszystkie obs.)p−wartosc = 0.581
Prosta : y = −1.8720 − 0.9770*x ( bez obserw. 6)p−wartosc = 0.0114
Prosta : y = −0.42 + 0.008899*x ( bez obserw. 1)p−wartosc = 0.936
Jan Mielniczuk
Obserwacje odstające i wpływowe
50 100 150
02
46
hills.lm$fitted
stu
dre
s(h
ills
.lm
)
Bens of Jura
Knock Hill
0 5 10 20 30
0.0
0.5
1.0
1.5
Indexcooks.d
ista
nce(h
ills
.lm
)
Lairig GhruKnock Hill
Bens of Jura
Jan Mielniczuk
Przekształcenie odpowiedzi
Data GALAPAGOS. Observations on species of turtles living on Galapagosislands and characteristics of these islands.Species - number of species on an islandEndemics - number of endemic speciesArea - island’s area (in km2)Elevation - elevation of the highest point (m)Nearest - distance to the nearest island (km)Scruz - distance to island Santa Cruz (km)Adjacent - area of the nearest island (km2)
0 100 200 300 400
−10
00
100
Rezydua modelu Species ~ .
Rez
ydua
5 10 15 20
−4
02
4
Rezydua modelu sqrt(Species) ~ .
Rez
ydua
Jan Mielniczuk
Example - data flow.
TEMP ~ PRESS + DENS + CODE + WGT + AWGT + TIME + PULSES + ULLAGE
Estimate Std. Error t value Pr(>|t|)(Intercept) 2.225e+02 1.048e+01 21.221 < 2e-16PRESS 2.253e-02 5.396e-03 4.175 5.88e-05DENS -2.143e+01 1.578e+00 -13.578 < 2e-16CODE -1.175e-01 4.024e-02 -2.920 0.00422WGT 9.802e-03 6.647e-03 1.475 0.14310AWGT -1.690e-02 2.187e-02 -0.773 0.44108TIME 6.897e-05 3.424e-04 0.201 0.84073PULSES 5.047e-02 1.450e-01 0.348 0.72849ULLAGE 1.655e-04 5.882e-04 0.281 0.77893
Multiple R-Squared: 0.9991, Adjusted R-squared: 0.999F-statistic: 1.586e+04 on 8 and 113 DF, p-value: < 2.2e-16
VIF indices for the full model
PRESS DENS CODE WGT AWGT TIME PULSES ULLAGE79.2596 495.1694 124.912 497.9868 5622.73 1.2134 5826.8656 1.2043
Jan Mielniczuk
Remove Pulses and AWGT having the largest value of VIF
PRESS DENS CODE WGT TIME ULLAGE72.5913 177.0520 113.518 1.1909 1.1977 1.1327
and DENS variable
PRESS CODE WGT TIME ULLAGE1.101051 1.2698 1.1766 1.1909 1.0786
VIF values for PRESS i CODE large for the second model, after removing DENSbecome small.
Jan Mielniczuk