1 Part IB. Descriptive Statistics Multivariate Statistics ( 多變量統計 ) Focus: Multiple regression Spring 2007

1

Part IB. Descriptive Statistics

Multivariate Statistics ( 多變量統計 )

Focus: Multiple regressionSpring 2007

2

Regression Analysis ( 迴歸分析 )

• Y = f(X): Y is a function of X

• Regression analysis: a method of determining the specific function relating Y to X

• Linear Regression ( 線性迴歸分析 ): a popular model in social science

• A brief review offered here– Can see ppt files on the course website

3

Example: summarize the relationship with a straight line

4

Draw a straight line, but how? ( 怎麼畫那條直線 ?)

5

Notice that some predictions are not completely accurate.

6

How to draw the line?

• Purpose: draw the regression line to give the most accurate predictions of y given x

• Criteria for “accurate”:

Sum of (observed y – predicted y)2 =

sum of (prediction errors) 2

[ 觀察值與估計值之差的平方和 ]

Called the sum of squared errors or sum of the squared residuals (SSE)

7

Ordinary Least Squares (OLS) Regression ( 普通最小平方法 )

• The regression line is drawn so as to minimize the sum of the squared vertical distances from the points to the line

( 讓 SSE 最小 )• This line minimize squared predictive error• This line will pass through the middle of th

e point cloud ( 迴歸線從資料群中間穿過 )(think as a nice choice to describe the relationship)

8

To describe a regression line (equation):• Algebraically, line described by its intercept ( 截

距 ) and slope ( 斜率 )• Notation:

y = the dependent variable

x = the independent variable

y_hat = predicted y, based on the regression line

β = slope of the regression line

α= intercept of the regression line

9

The meaning of slope and intercept:

• slope = change in (y_hat) for a 1 unit change in x (x 一單位的改變導致 y 估計值的變化 )• intercept = value of (y_hat) when x is 0•解釋截距與斜率時要注意到 x and y 的單位

10

General equation of a regression line: (y_hat) = α +βx

where α and β are chosen to minimize:

sum of (observed y – predicted y)2

A formula for α and β which minimize this sum is programmed into statistical programs and calculators

11

An example of a regression line

12

Fit: how much can regression explain? ( 迴歸能解釋 y 多少的變異？ )

• Look at the regression equation again:

(y_hat) = α +βx

y = α +βx + ε

• Data = what we explain + what we don’t explain

• Data = predicted + residual

( 資料有我們不能解釋的與可解釋的部分，即能預估的與誤差的部分）

13

In regression, we can think “fit” in this way:

• Total variation = sum of squares of y

• explained variation = total variation explained by our predictions

• unexplained variation = sum of squares of residuals

• R2 = (explained variation)/ (total variation) （判定係數）

[y 全部的變易量中迴歸分析能解釋的部分 ]

14

R2 = r2

NOTE: a special feature of simple regression (OLS), this is not true for multiple regression or other regression met

hods. [ 注意：這是簡單迴歸分析的特性，不適用於多元迴歸分析或其他迴歸分析 ]

15

Some cautions about regression and R2

• It’s dangerous to use R2 to judge how “good” a regression is. ( 不要用 R2 來判斷迴歸的適用性 )– The “appropriateness” of regression is not a f

unction of R2

• When to use regression?– Not suitable for non-linear shapes [you can m

odify non-linear shapes]– regression is appropriate when r (correlation)

is appropriate as a measure

16

補充 : Proportional Reduction of Error (PRE)( 消減錯誤的比例 )

• PRE measures compare the errors of predictions under different prediction rules; contrasts a naïve to sophisticated rule

• R2 is a PRE measure• Naïve rule = predict y_bar• Sophisticated rule = predict y_hat• R2 measures reduction in predictive error f

rom using regression predictions as contrasted to predicting the mean of y

17

Cautions about correlation and regression:

• Extrapolation is not appropriate• Regression: pay attention to lurking or omitted

variables– Lurking (omitted) variables: having influence on the

relationship between two variables but is not included among the variables studied

– A problem in establishing causation

• Association does not imply causation.– Association alone: weak evidence about causation– Experiments with random assignment are the best

way to establish causation.

18

Inference for Simple Regression

19

Regression Equation

Equation of a regression line:(y_hat) = α +βx y = α +βx + ε

y = dependent variablex = independent variableβ = slope = predicted change in y with a one unit c

hange in xα= intercept = predicted value of y when x is 0y_hat = predicted value of dependent variable

20

Global test--F 檢定 : 檢定迴歸方程式有無解釋能力 (β= 0)

21

22

The regression model ( 迴歸模型 )

• Note: the slope and intercept of the regression line are statistics (i.e., from the sample data).

• To do inference, we have to think of α and β as estimates of unknown parameters.

23

Inference for regression

• Population regression line:

μy = α +βx

estimated from sample:

(y_hat) = a + bx

b is an unbiased estimator ( 不偏估計式 )of the true slope β, and a is an unbiased estimator of the true intercept α

24

Sampling distribution of a (intercept) and b (slope)

• Mean of the sampling distribution of a is α

• Mean of the sampling distribution of b is β

25

Sampling distribution of a (intercept) and b (slope)

• Mean of the sampling distribution of a is α

• Mean of the sampling distribution of b is β

• The standard error of a and b are related to the amount of spread about the regression line (σ)

• Normal sampling distributions; with σ estimated use t-distribution for inference

26

The standard error of the least-squares line

• Estimate σ (spread about the regression line using residuals from the regression)

• recall that residual = (y –y_hat)

• Estimate the population standard deviation about the regression line (σ) using the sample estimates

27

Estimate σ from sample data

28

Standard Error of Slope (b)

• The standard error of the slope has a sampling distribution given by:

• Small standard errors of b means our estimate of b is a precise estimate of β

• SEb is directly related to s; inversely related to sample size (n) and Sx

29

Confidence Interval for regression slope

A level C confidence interval for the slope of “true” regression line β is

b ± t * SEb

Where t* is the upper (1-C)/2 critical value from the t distribution with n-2 degrees of freedom

To test the hypothesis H0: β= 0, compute the t statistic:

t = b/ SEb

In terms of a random variable having the t,n-2 distribution

30

Significance Tests for the slope

Test hypotheses about the slope of β. Usually:

H0: β= 0 (no linear relationship between the independent and dependent variable)

Alternatives:

HA: β ＞ 0 or HA: β ＜ 0

or HA: β ≠ 0

31

32

Statistical inference for intercept

We could also do statistical inference for the regression intercept, α

Possible hypotheses:

H0: α = 0

HA: α≠ 0t-test based on a, very similar to prior t-tests

we have doneFor most substantive applications, interested

in slope (β), not usually interested in α

33

Example: SPSS Regression Procedures and Output

• To get a scatterplot ():

統計圖 (G) → 散佈圖 (S) → 簡單 →定義（選x 及 y ）

• To get a correlation coefficient:

分析 (A) → 相關 (C) → 雙變量• To perform simple regression

分析 (A) → 迴歸方法 (R) → 線性 (L) （選 x及 y ）（還可選擇儲存預測值及殘差）

34

SPSS Example: Infant mortality vs. Female Literacy, 1995 UN Data

Infant Mortality vs. Female Literacy

109 countries, 1995 UN Data

Females who read (%)

120100806040200

Infa

nt

mort

ality

(d

eath

s p

er

10

00

liv

e b

irth

s)

200

100

0

35

Example: correlation between infant mortality and female literacy

相關

1 -.843**. .000

109 85-.843** 1.000 .

85 85

Pearson 相關 ( )顯著性雙尾

個數Pearson 相關

( )顯著性雙尾個數

BABYMORT Infantmortality (deaths per1000 live births)

LIT_FEMA Femaleswho read (%)

BABYMORT Infant mortality(deaths per 1000

live births)

LIT_FEMA Females who

read (%)

0.01 ( )在顯著水準為時雙尾，相關顯著。**.

36

Regression: infant mortality vs. female literacy, 1995 UN Data

模式摘要b

.843a .711 .708 20.6971模式1

R R 平方調過後的R 平方估計的標準誤

( ), LIT_FEMA Females who read (%)預測變數：常數a. \ BABYMORT Infant mortality (deaths per 1000依變數：

live births)b.

係數a

127.203 5.764 22.067 .000 115.738 138.668

-1.129 .079 -.843 -14.302 .000 -1.286 -.972

( )常數LIT_FEMA Femaleswho read (%)

模式1

B 之估計值標準誤未標準化係數

Beta 分配

標準化係數

t 顯著性下限上限

B 95% 迴歸係數的信賴區間

\ BABYMORT Infant mortality (deaths per 1000 live births)依變數：a.

37

Regression: infant mortality vs. female literacy, 1995 UN Data

模式摘要b

.843a .711 .708 20.6971模式1

R R 平方調過後的R 平方估計的標準誤

( ), LIT_FEMA Females who read (%)預測變數：常數a. \ BABYMORT Infant mortality (deaths per 1000依變數：

live births)b.

係數a

127.203 5.764 22.067 .000 115.738 138.668

-1.129 .079 -.843 -14.302 .000 -1.286 -.972

( )常數LIT_FEMA Femaleswho read (%)

模式1

B 之估計值標準誤未標準化係數

Beta 分配

標準化係數

t 顯著性下限上限

B 95% 迴歸係數的信賴區間

\ BABYMORT Infant mortality (deaths per 1000 live births)依變數：a.

變異數分析b

87617.840 1 87617.840 204.538 .000a

35554.673 83 428.370123172.513 84

迴歸殘差總和

模式1

平方和自由度平均平方和 F 檢定顯著性

( ), LIT_FEMA Females who read (%)預測變數：常數a. \ BABYMORT Infant mortality (deaths per 1000 live births)依變數：b.

38

Hypothesis test example

大華正在分析教育成就的世代差異，他蒐集到 117 組父子教育程度的資料。父親的教育程度是自變項，兒子的教育程度是依變項。他的迴歸公式是： y_hat = 0.2915*x +10.25

迴歸斜率的標準誤差 (standard error) 是 : 0.10

1. 在 α=0.05 ，大華可得出父親與兒子的教育程度是有關連的嗎？

2. 對所有父親的教育程度是大學畢業的男孩而言，這些男孩的平均教育程度預測值是多少？

3. 有一男孩的父親教育程度是大學畢業，預測這男孩將來的教育程度會是多少？

Documents

1 Part IB. Descriptive Statistics Multivariate Statistics ( 多變量統計 ) Focus: Multiple regression Spring 2007