102
R 语语语 语语语 语语语语语语 (2008 语 9 语 ) http://my.mofile.com/ tangyc8866 Practical Regression and ANOVA using R Julian J. Faraway July 2002

Part3. 实用教程--Practical Regression and ANOVA using R

Embed Size (px)

Citation preview

Page 1: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Practical Regression and ANOVA using R

Julian J. Faraway

July 2002

Page 2: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Chapter 1Introduction

11 . Look before you leap! Statistics starts with a problem, continues with the

collection of data, proceeds with the data analysis and

finishes with conclusions. It is a common mistake of inexperienced Statisticians to

plunge into a complex analysis without paying attention to what the objectives are or even whether the data are appropriate for the proposed analysis.

Page 3: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Formulation of a problem -- more essential than its solution, you must

1. Understand the physical background. 2. Understand the objective. 3. Make sure you know what the client wants. 4. Put the problem into statistical terms.

That a statistical method can read in and process the data is not enough. The results may be totally meaningless.

Page 4: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Data Collection

– important to understand how the data was collected

Initial Data Analysis -- critical step to be performed Numerical summaries Graphical summaries

• – One variable - Boxplots, histograms etc.• – Two variables - scatterplots.• – Many variables - interactive graphics.

Are the data distributed as you expected?

Page 5: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Example – pima (Page 9) Variables

pregnant -- 怀孕次数glucose -- 葡萄糖集中度diastolic -- 心脏舒张血压triceps -- 三头肌皮肤折叠厚度insulin -- 2 小时血清胰岛素bmi -- 身体重量指标 diabetesage -- 年龄test -- 糖尿病测试结果( 0- 阴性, 1- 阳性)

Initial data analysisPages 10 - 13

Page 6: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Two things unusual : • Max of pregnant = 17 ? Possible ?• Diastolic (blood presure)=0 ? Are they missing values?

Data rearranged – code the data • -- NA for missing values• -- designate test as a factor with labels

Graphical summary

Page 7: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

1.2 When to use Regression Analysis – Model relationship between Y and X1,…,Xp

Y --- response/output/dependent variable (continuous)

X1,…,Xp --- predictor/input/independent

/explanatory variables p=1 --- simple regression p>1 --- multiple regression --- diastolic on bmi and diabetes more than one Y --- multivariate regression Analysis of Covariance --- diastolic on bmi and test Analysis of Variance (ANOVA) --- diastolic on test Logistic Regression --- test on diastolic and bmi

Page 8: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Chapter 2Estimation

Linear Model Y=0+1 X1+ +p Xp + yi=0+1 x1i+ +p xpi +i, i=1,2,,n Matrix/vector representation

Data = Systematic Structure + Random Variation

N dimensions = p dimensions + (n-p)dimensions

where y=(y1,,yn)T , =(1,,n)T, =(1,,p)T

Estimation of such that X is close to Y.

Page 9: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Geometrical representation

Y

Residual

X

Page 10: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Least squares estimation (LSE)

H=X(XTX)-1XT --- hat-matrix

--- orthogonal projection of y onto M(X)

1

1 2

ˆ ( )

ˆ

ˆ ˆ( ) , ( ) ( )

T T

T

X X X y

X Hy

E Var X X

ˆˆPredicted values : y=Hy=X

ˆResiduals: =( - )

ˆ ˆResidual sum of squares: ( )T T

I H y

y I H y

Page 11: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

is a good estimate? • Geometrically good – orthogonal projection• MLE if i iid normal

• BLUE if E()=0 and Var()=2I --- from Gauss-Markov Theorem

Situations for other than LSE:• GLSE if errors are correlated or have unequal variances• Robust estimate if errors are long tailed• Biased estimate (e.g. ridge regression) if predictors are hig

hly correlated (i.e. collinear).

Page 12: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Goodness of fit

--- percentage of variance explained

22

2

ˆ( ) RSS1 1

( ) Total RSSi i

i

y yR

y y

2 ˆ ˆˆ ˆ:

T

n p

Page 13: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Example --- Gala (Page 23-25) Variables

Species -- # of species of tortoiseEndemics -- # of endemic speciesElevation -- highest elevation of the island (m)Nearest -- distance from the nearest island (km)Scruz -- distance from Santa Cruz island (km)Adjacent -- area of the adjacent island (km2)

gfit <- lm(Species ~ Area + Elevation + Nearest + Scruz + Adjacent, data = gala)

summary(gfit)

Page 14: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Chapter 3Inference

General Assumptions for linear models

y» N(X,2I)

Thus

Hypothesis tests to compare modelsTwo models: – Large model, with q parameters (dimension) – small model, with p parameters which consists of a

subset of predictors that are in Law of parsimony – use if the data support it! See the geometric illustration on page 27.

1 1 2ˆ ( ) ( , ( ) )T T TX X X y N X X

Page 15: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Small model spaceLarge model space

Y

• Geometrical illustration:

Residual for Residual for

Difference between two models

Page 16: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Test statistic

From Cochren’ theorem, we have

Reject the null hypothesis () if F>Fq-p,n-q()

,

,

or equivalaently the likelyhood-ra

max ( , | )

m

tio st

a

atistic

x ( , | )

RSS RSS

RSS

L y

L y

,

( ) /( )

/( ) q p n q

RSS RSS q pF

RSS n q

Page 17: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Comment:The same test statistic applies not just when is a subset of but also to a subspace. This test is very widely used in regression and analysis of variance. When it is applied in different situations, the form of test statistic may be re-expressed in various different ways. The beauty of this approach is you only need to know the general form. In any particular case, you just need to figure out which models represents the null and alternative hypotheses, fit them and compute the test statistic. It is very versatile.

Page 18: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Some examples 1. Test of all predictors

Are any of the predicators useful ?

: y=X+, X: n by p

: y=+ – predict y by mean

(H0: 1==p-1=0)Now we find that

-- sum of squares corrected for the mean

ˆ ˆ( ) ( ) ,

( ) ( ) , =n-1

T

T

RSS y X y X RSS df n p

RSS y y y y SYY df

Page 19: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Comment:• A failure to reject the null hypothesis is not the end of

the game — you must still investigate the possibility of non-linear transformations of the variables and of outliers which may obscure the relationship. Even then, you may just have insufficient data to demonstrate a real effect which is why we must be careful to say “fail to reject” the null rather than “accept” the null.

It would be a mistake to conclude that no real relationship exists.

Page 20: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

• When the null is rejected, this does not imply that the alternative model is the best model. We don’t know whether all the predictors are required to predict the response or just some of them. Other predictors might also be added —for example quadratic terms in the existing predictors. Either way, the overall F-test is just the beginning of an analysis and not the end.

• Example – old economic dataset on page 29.variables: dpi – per-capita disposable incomeddpi – percent rate of change in dpisr – aggregate personal saving divided by disposable incomepop15, pop75 – percentage population under 15 and over 75

Page 21: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

2. Testing just one predictor Can one particular predicator be dropped?

• RSS – RSS for the model with all the predicators • RSS – RSS for the model with all the predicators except pr

edicator i.

Test statistic:• F – statistic (with dfs 1 and n-p) or • t-statistic with df n-p:

Example -- Savings (Page 30) --- Three methods:

• 1) F-statistic• 2) t-statistic• 3) compare two nested models using anova(g2,g)

ˆ ˆ/ ( )i i it se

Page 22: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

3. Testing a pair of predictors

Suppose we wish to test the significance of variables Xj and Xk. We might construct a table as shown just above and find that both variables have p-values greater than 0.05 thus indicating that individually neither is significant. Does this mean that both Xj and Xk can be eliminated from the model? Not necessarily

Page 23: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Except in special circumstances, dropping one variable from a regression model causes the estimates of the other parameters to change so that we might find that after dropping Xj, that a test of the significance of Xk shows that it should now be included in the model.

If you really want to check the joint significance of Xj and Xk, you should fit a model with and then without them and use the general F-test discussed above. Remember that even the result of this test may depend on what other predictors are in the model.

Page 24: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

3. Testing a subspaceWhether two predicators be replaced by their sum?Null hypothesis: H0: j=k (a linear subspace)

Example – Savings • 1) H0: pop15=pop75

Null model :

y=0+pop15(pop15+pop75)+dpidpi+ddpiddpi+ > g <- lm(sr ~ . , savings) gr <- lm(sr ~ I(pop15 + pop75) + dpi + ddpi, savings) > anova(gr,g)% The period . stands for the other variables in the data frame% I() argument is evaluated rather than interpreted as part of the model

formula

Page 25: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

• 2) H0: ddpi=1

Null model :

y=0+pop15pop15+pop75pop75+dpidpi+ddpi+

% The fixed term is called an offset

> gr <- lm(gr ~ pop15 + pop75 + dpi + offset(ddpi), savings)

> anova(gr, g)

Two other ways as usual:

t-statistic :

F- statistic : square of t-statistic

where c is th

ˆ ˆ

e

p oi

( ) / ( )

nt hy

,

pothesis

t c se

Page 26: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Confidence Intervals for Confidence intervals are closely related to hypothesis

tests: For a 100(1-)% confidence region, any point that lies within the region represents a null hypothesis that would not be rejected at the 100% level while every point outside represents a null hypothesis that would be rejected.

The confidence region provides a lot more information than a single hypothesis test in that it tells us the outcome of a whole range of hypotheses about the parameter values.

Page 27: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Simultaneous confidence region for

Confidence interval for one i • General form

estimate +/- critical value x s.e. of estimate• For i

• It’s better to consider the joint confidence intervals when plausible, especially when the estimates of i’s are heavily correlated.

2 ( ),

ˆ ˆ ˆ( ) ( )T T Tp n pX X p F

( / 2) 1ˆ ˆ ( )Ti n p iit X X

Page 28: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Example – Savings • 95% confidence interval for pop75• 95% confidence interval for growth• Joint 95% confidence region for pop75 and growth

Note : “ellipse” is needed for drawing confidence ellipses

Page 29: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Confidence intervals for predictions Q: Given a set of predictors x0, what is the predicted

response?

A:

Two predictions: • of the future mean response• of future observations

The point estimate are the same, while the confidence intervals are respectively

0 0ˆˆ Ty x

( / 2) 10 0 0

( / 2) 10 0 0

ˆ 1 ( )

ˆ ( )

T Tn p

T Tn p

y t x X X x

y t x X X x

Page 30: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Orthothonality (page 41) Identifiability (page 44) What can go wrong? (page 46)

• Source and quality of the data• Error component• Structural component

Confidence in the conclusions from a model declines as we progress through these.

Most statistical theory rests on the assumption that the model is correct. In practice, the best one can hope for is that the model is a fair representation of reality. A model can be no more than a good portrait.

Page 31: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

All models are wrong but some are useful.

George Box So far as theories of mathematics are about reality;

they are not certain; so far as they are certain, they are not about reality.

Einstein

Page 32: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Prediction and Extrapolation (Page 52)• Is the new X0 within the range of validity of the model?

• Is it close to the range of the original data?• If not,

* the prediction may be unrealistic and

* confidence intervals for prediction get wider as we move away from the data.

Example• Model:

>g4 <- lm(sr ~ pop75, data=savings)

>summary(g4)

Page 33: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

• Prediction/fit and Confidence interval (CI)>grid <- seq(0, 10, 0.1)>p <- predict(g4, data.frame(pop75=grid), se=T)>cv <- qt(0.975, 48)>matplot(grid, cbind(p$fit, p$fit-cv*p$se, p$fit+cv*p$se), + lty=c(1, 2, 2), type =“1”,xlab=“pop75”,ylab=“Savings”)>rug(savings$pop75) %show the location of pop75 values

Conclusion (See Figures 3.6 and 3.7): the widening of the CI does not reflect the possibility that the structure of the model itself may change as we move into new territory.

Prediction is a tricky business --- Perhaps the only thing worse that a prediction is no prediction at all. (a quote from the 4th century)

Page 34: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Chapter 4Errors in Predictors

Regression model Y=X+ allows for Y being measured with error by having the term.

What if X is measured with error ( is not the X used to generate Y)?

Conider (xi,yi,i=1,2,,n):

where and are independent. Ei=E

i=0, var

i=2, var

i

=2. ,

=cov(,).

True underlying relationship:

i=0+1 i

2 2( ) /i n

Page 35: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Then

Thus the estimate of 1 will be biased! If the variability in the errors of observation of X (2

) are small relative to the range of X (2

), we need not be concerned.

If not, it’s a serious problem. Other methods should be considered!

1 2

2cov( , ) 0

1 1 12 2 2 2

( )ˆ( )

1ˆ2 1 /

i i

i

x x y

x x

E

Page 36: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Attention on the difference:• Errors in predictors case• Treating X as a r.v. (for observational data, assume that Y is gene

rated conditional on the fixed value of X)

Example (page 56)> x <- 10*runif(50)

> y <- x + rnorm(50)

> gx <- lm(y ~ x) The regression coeffs are 0=0 and 1=1. See how are they chang

ed when the noise 5*rnorm(50) is added. decreased from 1 to 0.25 theoretically, from 0.97 to 0.435 numerically.

As 2=1/12,2=25 and 2

=100/12, the expected mean is 0.25 (see the slide above and the simulation on P57)

Page 37: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Chapter 5Generalized Least Squares

The general case Error: var = 2 , where

2 is unknown is known.

GLS to minimize (y-X)T-1 (y-X)

GLS=OLS by regressing S-1X on S-1y, where S is the triangular matrix using the Cholesky Decomposition of : =SST. (var S-1=2I)

Example: Longley’s regression data (page 59)• Errors: i+1=i+i,i» N(0,2)• =(ij): ij=|i-j| • Two methods: 1) Use the formula 2) Choleski Decomposition

1 1 1ˆ ( )T TX X X y

Page 38: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Revisit the example using nlme library (Page 61) Weighted Least Squares

=diag(1/w1,,1/wn) – diagonal and uncorrelated

weights: wi, i=1,2,…,n. Low/high variability high/low weight.

Example -- strongx: • An experiment was designed to study the interaction of cert

ain kinds of elementary particles on collision with proton targets.

Regress on using OLSi i i iw x w y

Page 39: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

• The cross-section (crossx) variable is believed to be linearly related to the inverse of the energy (energy – being inverted!). At each level of the momentum, a very large number of observations were taken so that it was accurately estimate the std of the response (sd).

• (Figure 5.1) The unfitted is better? For lower values of energy, the variance in the response is less, thus the weight is high. So the weighted fit tries to catch these points better than the others.

Page 40: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Iteratively Reweighted Least Squares(IRWLS), p64 Var() is not completely known Model (e.g. var =0+1x1) and then estimate using

OLS. Iterate until convergence. Use gls() in the nlme library – model the variance and jo

intly estimate the regression and weighting parameters using likelihood based method.

Page 41: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Chapter 6Testing for Lack of Fit

Estimate of 2

If the model is correct, then should be unbiased. If the model is simple but not correct, then will overestimate 2. If the model is too complex and overfits the data, then will un

derestimate. Test statistic:

2 known Lack of fit if

Example – strongx Model 1: without a quadratic term lack of fit (p-value=0.0048) Model 2: with a quadratic term well fit (p-value=0.85363)

2

2

2

2

22 (1 )

2

ˆ( )n p

n p

Page 42: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

2 unknown The based on regression model needs to be compared to some

model-free estimates. Repeat y independently for one or more x. To reflect both the within subject variability and between subject vari

ability, we use the “pure error” estimate of 2, given by SSpe/dfpe, where

2i

x x

x

(y )

(# 1)

pedistinct given

pedistinct

SS y

df replicates

2

Page 43: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

If you fit a model that assigns one parameter to each group of observations with fixed x then the from this model will be the pure error . You use factor(x) instead of x directly! See the example below.

Compare this model to the regression model amounts to the lack of fit test. -- most convenient way.

Alternative (ANOVA): Partition RSS into • that due to lack of fit• that due to the pure error

22

df SS MS F

Residual n-p RSS Ratio of MS

Lack of Fit n-p-dfpe RSS-SSpe

Pure Error dfpe SSpe SSpe/dfpe

pe

pe

RSS SS

n p df

Page 44: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Lack of fit if the F-statistic >F(1-)n-p-dfpe,dfpe

Example – corrosion:Thirteen specimens of 90/10 Cu-Ni alloys with varying iron cont

ent in percent. The specimens were submerged in sea water for 60 days and the weight loss due to corrosion was recorded in units of milligrams per square decimeter per day.

• Model 1: > g <- lm(loss ~ Fe, data=corrosion)Both R2=97% and graph (Figure 6.3) show good fit to the data.

• Model 2: (Reserve a parameter for each group of data with the same x.The fitted values are the means in each group, see Figure 6.3.)

> ga <- lm(loss ~ factor(Fe),data=corrosion)

Page 45: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Compare the two models: > anova(g, ga)• p-value = 0.0086 a lack of fit !• Reason:

regression std=3.06, pure error sd=sqrt(11.8/6)=1.4.

( replicates are genuine? Low pe sd? Maybe caused by some correlation in the measurements. Unmeasured third variable.)

• Another model other than a straight line (though not obvious!)> gp <- lm(loss ~ Fe+I(Fe^2)+I(Fe^3)+I(Fe^4)+I(Fe^5)+I(Fe^6), + corrosion)>summary$r.squared [1] 0.99653

• R2=99.7%

Page 46: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

• The fit of this model is excellent (see Figure 6.4)

but it is clearly ridiculous!• This is a consequence of OVERFITTING the data.

About R2 and goodness of fit.• No need to become too focused on measures of fit like R2

• If one method shows that the null hypothesis is accepted, we cannot conclude that we have the true model. After all, it may be that we just did not have enough data to detect the inadequacies of the model. All we can say is that the model is not contradicted by the data.

• It is also possible to detect lack of fit by less formal, graphical methods.

Page 47: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

• A more general question is how good a fit do you really want?

• By increasing the complexity of the model, it is possible to fit the data more closely. By using as many parameters as data points, we can fit the data exactly.

• Very little is achieved by doing this since we learn nothing beyond the data itself and any predictions made using such a model will tend to have very high variance.

• The question of how complex a model to fit is difficult and fundamental.

Page 48: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Chapter 7Diagonostics

Regression model building is often an iterative and interactive process. The first model we try may prove to be inadequate.

Regression diagnostics are used to detect problems with the model and suggest improvements.

This is a hands-on process.

Page 49: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Residuals and Leverage H – hat-matrix: H=X(XTX)-1XT, thus

hi=Hii -- leverages

: a large leverage of hi will “force” the fit to be close to yi.

Some facts: hi=p, hi >= 1/n

2 2

ˆ

ˆ ˆ ( )

( ) ( )

( )

ˆvar ( ) (var )

y Hy

y y I H y

I H X I H

I H

I H I

2ˆ (1 )i ih

Page 50: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

“Rule of thumb” – leverage of more than 2p/n should be looked at more closely. Large values of hi are due to extreme values in X. (hi corresponds to a Mahalanobis distance defined by X.) Also notice that .

Example – savings (page 72)• Index plot of residuals Chile and Zambia with largest or s

mallest residual. • Index plot of leverages Libya and United States with high

est leverage.

2ˆ ˆ ˆvar iy h

Page 51: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Studentized Residuals (internally) studentized residuals:

If the model assumptions are correct, then var ri=1 and corr(ri,rj) is small

Thus studentization can correct for the non-constant variance in residuals when the errors have constant variance, but cannot correct for it when there is some heteroscedascity in the errors.

Example – savings (page 75)

2ˆvar (1 )ih ˆ

ˆ 1i

i

i

rh

Page 52: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

An outlier Test An outlier is a point that does not fit the current model.

Outliers may effect the fit. See Figure 7.2. An outlier test enable us to distinguish between truly un

usual points and residuals which are large but not exceptional.

Exclude point i and recompute the estimates to get

If is large then point i is an outlier.

2( ) (i)

( ) ( )

ˆ ˆ and

ˆˆ

i

Ti i iy x

( )ˆ i iy y

Page 53: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Test statistic (jacknife or externally studentized/ crossvalidated ) residuals:

When all cases are to be tested we must adjust the level of the test using Bonferroni correction. However it is conservative!

Example – savings (page 76) Some notes about outliers and their treatment with an ex

ample: Star (page 76-78)

1/ 212

( )

ˆ 1( )

ˆ 1i

i i n pii i

n pt r t

n p rh

Page 54: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Influential Observations An influential point is one whose removal from the

dataset would cause a large change in the fit. An influential point may or may not be an outlier and

may or may not have large leverage but it will tend to have at least one of those two properties.

Measures of influence • Change in the coefficient

• Change in the fit

( )ˆ ˆ

i

( ) ( )ˆ ˆˆ ˆ ( )T

i iy y X

Page 55: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

• Cook Statistics – Popular now

• The first term, ri2, is the residual effect and the second is the

leverage. The combination leads to the influence.• An index plot of Di can be used to identify influential points.

Example – savings (page 79)• Full data• Exclude one point -- helpful using lm.influence()

( ) ( )

2

2

ˆ ˆ ˆ ˆ( ) ( )( )

ˆ

1

1

T Ti i

i

ii

i

X XD

p

hr

p h

Page 56: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Residual Plots Residuals vs. fitted – to look for

• deteroscedscity (non-constant variance)• nonlinearity

Residuals or abs(residuals) vs. one of the predictors xi – look for if xi should be included.

Example 1 – savings (pages 81-83, Figure 7.6)Scatter plot of residuals vs. predicator pop15 show non-constant variance in two groups.

Example 2 – simulation example (page 83)

Page 57: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Non-Constant Variance Two approaches to dealing with non-constant variance. 1) weighted least squares 2) variance stabilizing transformation y->h(y):

• var y/ (Ey)2 h(y)=log y• var y/ (E y) h(y)=\sqrt(y) – appropriate for count respons

e dataNote: Use graphical techniques (e.g. residual plots) to detect

problems/structures of unsuspected natures. Example – Galapagos (page 84)

( )var

dyh y

y

Page 58: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Non-Linearity How to check if the systematic part Ey=X is correct? As before, we may look at

• Plots of residuals against fitted and predictors xi

• Plots of y against each xi

What about the effect of other x on the y vs. xi plot? Partial Regression plots (or Added Variable plots):

can isolate the effect of xi on y

• Regress y on all x except xi, get residuals ;

• Regress xi on all x except xi, get residuals ;

• Plot against .

Page 59: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Partial Residual plots:

The slope of a line fitted is for both plots. Comparison: Partial residual plots are reckoned to be

better for non-linearity detection while added variable plots are better for outlier/influential detection.

Example – savings (pages 85-87)

i

iˆPlot + against .i ix x

Page 60: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Assessing Normality Use QQ-plot Example 1 – savings (page 88, see Figure 7.9) Example 2 – simulated data from different distributions

(page 88, see Figure 7.10) Ways to treat non-normality (page 90)

Half-normal plots Designed for assessment of positive data Useful for leverages or Cook Statistics – look for

outliers in the model. ! No need to look for a straight line relationship Example – savings (page 91)

Page 61: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Corrected Errors Plot residuals against time Use formal tests like the Durbin-Watson or the run test.

(If errors are correlated, you can use GLS) Example – airquality (pages 93-94)

• Variables: ozone solar radiation temperature wind speed

• Non-transformed response non-constant variable and nonlinearity

• Log transformed response • Check for serial correlation

Page 62: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Chapter 8Transformation

Transformations of the response and predictors can improve the fit and correct violations of model assumptions such as constant error variance.

We may also consider adding additional predictors that are functions of the existing predictors like quadratic or crossproduct terms.

Page 63: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Transforming the response Example – Log response

log y=0+1 x +(linear regression)

In the original scale:y=exp(0+1 x) exp()

The errors enter multiplicatively in the original scale.

Compare:

y=exp(0+1 x) +non-linear regression) The errors enter additively in the original scale. Usual approach: Try different transformations and

check the residuals.

Page 64: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Box-Cox method/transformation (for y>0)

• Choose using maximum likelihood.• Profile log-likelihood assuming normal errors:

where RSS is the RSS when t(y) is the response.

• Usually L() is maximized over {-2,-1,-1/2,0,1/2,1,2}

Page 65: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Transforming the response can make the model harder to interpret so we don’t want to do it unless it’s really necessary. One way to check this is to form a confidence interval for . A 100 (1-)% confidence interval for is

Example 1 – savings (page 96) CI()=(0.6,1.4) NO need to transform

Example 2 – galaCI()=(0.1,0.5) cube-root transformation

Page 66: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Other transformations• Logit transformation log(y/(1-y)) for proportions• Fisher’s Z transformation 0.5log((1+y)/(1-y)) for correlations

Transforming the predictors Segmented regression (Broken stick regression) Example – savings (page 98-100)

• Method 1: Subset regression fit – discontinuous at knotpoint.

• Method 2: with hocky stick function. Polynomials & orthagonal polynomial:

add powers of the predictor(s) until the p-value is significant.

Page 67: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Regression SplinesPolynomials have the advantage of smoothness but the disadv

antage that each data point affects the fit globally. This is because the power functions used for the polynomials take non-zero values across the whole range of the predictor.

In contrast, the broken stick regression method localizes the influence of each data point to its particular segment which is good but we do not have the same smoothness as with the polynomials.

There is a way we can combine the beneficial aspects of both these methods — smoothness and local influence — by using B-spline basis functions.

Page 68: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Cubic B-spline basis fuction:• A given basis function is non-zero on interval defined by fo

ur successive knots and zero elsewhere.

This property ensures the local influence property.• The basis function is a cubic polynomial for each sub-inter

val between successive knots• The basis function is continuous and continuous in its first

and second derivatives at each knot point.

This property ensures the smoothness of the fit.• The basis function integrates to one over its support

A constructed example: y=sin2(2x3)+, » N(0,0.12)1) Orthogonal polynomial regression

2) Regression splines

Page 69: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Chapter 9 Scale Changes, Principle Components and Collinearity

Changes of Scale (page 106)

Principle Components Regression (PCR) (page 107) -- Transform X to orthogonality

Partial Lest Squares (PLS)(page 114)-- Find the best orthogonal linear combination of X for predicting Y

2i

2i i

2

2i

ˆRescaling as ( ) / leaves the t and F tests and

ˆ ˆand R unchanged and b .

Rescaling y in the same way leaves the t and F tests and R

ˆˆunchanged but and will rescaled by b.

ix x a b

Page 70: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

PCR and PLS comparedPCR:

• PCR attempts to find linear combinations of the predictors that explain most of the variation in these predictors using just a few components.

• The purpose is dimension reduction. • Because the principal components can be linear

combinations of all the predictors, the number of variables used is not always reduced.

• Because the principal components are selected using only the X-matrix and not the response, there is no definite guarantee that the PCR will predict the response particularly well although this often happens.

Page 71: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

• If it happens that we can interpret the principal components in a meaningful way, we may achieve a much simpler explanation of the response. Thus PCR is geared more towards explanation than prediction.

PLS:• In contrast, PLS finds linear combinations of the predictors t

hat best explain the response. It is most effective when there are large numbers of variables to be considered.

• If successful, the variablity of prediction is substantially reduced.

• On the other hand, PLS is virtually useless for explanation purposes.

Page 72: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Collinearity (page 117) If XTX is singular, i.e. some predictors are linear combi

nations of others, we have (exact) collinearity and there is no unique least squares estimate of . If XTX is close to singular, we have (approximate) collinearity or multicollinearity (some just call it collinearity). This causes serious problems with the estimation of and associated quantities as well as the interpretation.

Collinearity can be detected in several way.• Examination of the correlation matrix of the predictors will re

veal large pairwise collinearities.• A regression of xi on all other predictors gives R2

i. Repeat for all predictors. R2

i close to one indicates a problem.

Page 73: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

• If R2i is close to one then the variance inflation factor

(VIF) 1/(1-R2i) will be very large, thus large

• Examine the eigenvalues of XTX - small eigenvalues indicate a problem. The condition number is defined as

• Where >30 is considered large. Other condition numbers, sqrt(1/i) are also worth considering because they indicate whether more than just one independent linear combination is to blame.

ˆvar( )i

Page 74: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Collinearity leads to• imprecise estimates of —the signs of the coefficients may

be misleading. • t-tests which fail to reveal significant factors• missing importance of predictors

Example – Longley (pages 118-120)

Page 75: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Ridge Regression Ridge regression makes the assumption that the

regression coefficients (after normalization) are not likely to be very large.

It is appropriate for use when the design matrix is collinear and the usual least squares estimates of appear to be unstable.

The ridge regression estimates of b are then given by

Suitable is chosen from ridge trace plot Example – Longley (page 121)

Page 76: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Ridge regression estimates of coefficients are biased. Bias is undesirable but there are other considerations.

MSE decomposition:

Sometimes a large reduction in the variance may obtained at the price of an increase in the bias. If the MSE is reduced as a consequence then we may be willing to accept some bias. This is the trade-off that Ridge Regression makes - a reduction in variance at the price of an increase in bias.

This is a common dilemma.

2 2 2

2

ˆ ˆ ˆ ˆ( ) ( ( )) ( )

=bias variance

E E E E

Page 77: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Chapter 10 Variable Selection

Select the best subset of predictors Prior to variable selection:

• Identify outliers and influential points - maybe exclude them at least temporarily.

• Add in any transformations of the variables that seem appropriate.

Hierarchical Models When selecting variables, it is important to respect the

hierarchy. Lower order terms should not be removed from the model before higher order terms in the same variable.

There two common situations where this situation arises:

Page 78: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

• Polynomials models.• Models with interactions.

Stepwise Procedures Backward Elimination Forward Elimination Stepwise Regression Example – statedata (page 126)

Criterion-based Procedures The Akaike Information Criterion (AIC)

AIC=-2log (likelihood) +2p

Page 79: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

and the Bayes Information Criterion (BIC) BIC= -2log –likelihood +p log n

(-2log-lokelihood=nlog(RSS/n))

Example – statedata Adjusted R2 = R2

a

R2=1-RSS/TSS

2 2

2mod

2

/( ) 11 1 (1 )

/( 1)

ˆ 1

ˆ

a

el

null

RSS n p nR R

TSS n n p

Page 80: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Predicted Residual Sum of Squares (PRESS)

Mallow’s Cp Statistic

Where is from the model with all predictors and RSSp indicates the RSS from a model with p parameters.

Example – statedata (page 130)

2( )

(i)

ˆ ,

are the resideuals calculated

without using case i in the fit

iiPRESS

22 ,

ˆp

p

RSSC p n

2

Page 81: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Chapter 11Statistical Strategy and Model Uncertainty

Strategy Procedure:Diagnostics Transformation Variable Selection DiagnosticsThere is a danger of doing too much analysis. The more

transformations and permutations of leaving out influential points you do, the better fitting model you will find. Torture the data long enough, and sooner or later it will confess.

Fitting the data well is no guarantee of good predictive performance or that the model is a good representation of the underlying population.

Page 82: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

You should:1. Avoid complex models for small datasets.2. Try to obtain new data to validate your proposed

model. Some people set aside some of their existing data for this purpose.

3. Use past experience with similar data to guide the choice of model.

Model multiplicity: The same data may support different models. Conclusions drawn from the models may differ quantitatively and qualitatively.

See further from author’s experiment and discussion (page 135-137)

Page 83: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Chapter 12A Complete Example

Page 84: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Chapter 13Robust and Resistant Regression

Errors ~Normal LSE Error ~ long-tailed

LSE with outlier removed through outlier tests Least Trimmed Squares(LTS)

• Resistant regression• ltsreg() in library lqs (see example -- Chicago on page 152)

Robust regression. e.g. M-estimate:

Possible choices of

Page 85: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

• (x)=x2 -- least squares regression• (x)=|x| -- least absolute deviations regression (LAD)• Huber’s method – a compromise between LS and LAD

• rlm() in the library of MASS• Example – Chicago insurance data (page 151)

Page 86: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Chapter 14Missing Data

Page 87: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Chapter 15Analysis of Covariance

An example – Regression problem with a mixture of quantitative and qualitative predicators Purpose: to see the effect of a medication on cholestero

l level Two groups:

• 1) Treatment: receive the medication, age 50 ~ 70• 2) Control: not receive the medication, age 30 ~ 50

Figure 15 shows that the mean reduction in cholesterol was 0% and 10% respectively. Better not to be treated?

Not a two sample problem as the two groups differ w.r.t. age

Page 88: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Take age into account (age 50) the difference between the two groups is again 10%, but in favor of the treatment!

Analysis of covariance: adjust the groups for the age (a covariate) difference and exam the effect of the medication

Method: code the qualitative predicator(s) (covariate(s)) and incorporate within the y=X+ framework.

• y – change in cholesterol level• x –age • d – 0 (or -1) for the control group and 1 for the treatment gr

oup (See 15.2 for coding qualitative predicators, p164)

Page 89: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Models:• y=0+1x+ (in R y~x)• y=0+1x+2d+ (in R y ~ x+d)• y=0+1x+2d+3x.d+ (in R y ~ x*d)

A two-level example (page 161) (English medieval cathedrals)

Variables • X – nave height• Y – total length• Style – r: Romanesque, g: Gotheic

Note that some have parts in both styles Purpose: to see how the length is related to height for the tw

o styles

Page 90: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Coding qualitative predicators For a k-level predictor, k-1 dummy variables are

needed for the representation. One parameter is used to represent the overall mean effect or perhaps the mean of some reference level and so only k.

Treatment codingA 4 level factor will be coded using 3 dummy variables: Dummy coding 1 2 3 1 0 0 0levels 2 1 0 0 3 0 1 0 4 0 0 1

Page 91: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

• The first one is the reference level to which other levels are compared.

• R assigns levels to a factor in alphabetical order by default.• The columns are orthogonal and the corresponding dummie

s will be too. The dummies won’t be orthogonal to the intercept.

• Treatment coding is the default choice for R Helmert Coding (page 165)

• interpretation. It is the default choice in S-PLUS. The choice of coding does not affect the R2, ,the ov

erall F-statistic. It does affect and you do need to know what the coding is before making conclusions abo

ut

2

Page 92: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

A three-level example (page 165) The data:

• IQ scores for identical twins, one raised by foster parents, the other by the natural parents.

• social class of natural parents (high, middle or low). Purpose: Predict the IQ of the twin with foster parents

from the IQ of the twin with the natural parents and the social class of natural parents.

Seperate lines model• > g <- lm(Foster ~ Biological*Social, twins)• > summary(g)

Page 93: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

• The reference level is high class, being first alphabetically. • We see that the intercept for low class line would be -1.872

+9.0767 while the slope for the middle class line would be 0.9776-0.005.

Can the model be simplified to the parallel lines model?• > gr <- lm(Foster ˜ Biological+Social, twins)• > anova(gr,g)

Further reduction to a single line model • > gr <- lm(Foster ˜ Biological, twins)

Conclusion : Single line model is accepted.

(p-value=0.59423 the null is not rejected)

Page 94: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Chapter 16ANOVA

Introduction Predictors are now all categorical/ qualitative. ANOVA is used to partition the overall variance in the

response to that due to each of the factors and the error.

Predictors are now typically called factors which have some number of levels.

The parameters are now often called effects. Two kinds of models

• fixed-effects models-- parameters are considered fixed• random-effects models -- parameters are random

Page 95: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

One-Way Anova The model Estimation and testing

• Estimate the effects• Test if there is a difference in the levels of the factor.

Example (page 169) • Response: blood coagulation times • Factor: diets • Sample size: 24 (animals)

Diagnosis (page 171)

Page 96: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Multiple Comparisons( 多重比较 ) (page 172) • Bonferroni (for few comparisons it is good but it will become

conservative if there many comparisons)• Fisher’s LSD (after overall F-test shows a difference)• Turkey’s Honest Significant Difference (HSD) (fall all pairwis

e comparisons) Contrasts Scheffe’s theorem for multiple comparisons Testing for homogeneity of variance– Levene test

Page 97: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Two-Way Anova One observation per cell More than one observation per cell Interpreting the interaction effect

Example – rats (pages 181-184)• 48 rats were allocated to 3 poisons (I,II,III) and 4 treatments

(A,B,C,D). • The response was survival time in tens of hours.• 4 replicates in each cell• The reference level is I for poison and A for treat!

Page 98: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Blocking designs (page 185)

Latin Squares (191)

Blanced Incomplete Block design (195)

Factorial experiments (200)

Page 99: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

后记计算与统计分析软件

符号运算软件 Maple v9.0 Mathematica (v5.0) Symbolic toolbox for Matlab (v7.0) Scientific workplace (SWP) Maxima (from Macsyma) MuPAD Reduce ……

Page 100: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

数值计算软件 Matlab Scilab

统计分析软件•SPSS (v12)•SAS (v9.1)•BMDP(v7)•Statistica(v6)•Stat a(v8)•Gauss

•S-plus(v6) •R (v9)•Minitab (v14)•Systat(v10•JMP (v5)•MacAnova

Page 101: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

Bayes 统计分析 WinBus (v1.4)( 包括 ARS) CODA (v0.5-1 for R) JAGS (v0.50)

计量经济学数据分析软件 Tsp (v6.5) Eviews (v4) Istm2000 TSA OX/OXMatrix packages

• PcGets, PcGive, Tsp, ARFIMA, GARCH, MSVAR

Page 102: Part3. 实用教程--Practical Regression and ANOVA using R

R 语言与统计分析 – 华东师范大学 (2008 年 9 月 )http://my.mofile.com/tangyc8866

数据处理软件的综合比较 Comparison of mathematical

programs for data analysis (Edition 4.4)

by Stefan Steinhaus

([email protected])

http://www.scientificweb.com/ncrunch/