View
255
Download
0
Embed Size (px)
Citation preview
1
Part IB. Descriptive Statistics
Multivariate Statistics ( 多變量統計 )
Focus: Multiple regressionSpring 2007
2
Regression Analysis ( 迴歸分析 )
• Y = f(X): Y is a function of X
• Regression analysis: a method of determining the specific function relating Y to X
• Linear Regression ( 線性迴歸分析 ): a popular model in social science
• A brief review offered here– Can see ppt files on the course website
3
Example: summarize the relationship with a straight line
4
Draw a straight line, but how? ( 怎麼畫那條直線 ?)
5
Notice that some predictions are not completely accurate.
6
How to draw the line?
• Purpose: draw the regression line to give the most accurate predictions of y given x
• Criteria for “accurate”:
Sum of (observed y – predicted y)2 =
sum of (prediction errors) 2
[ 觀察值與估計值之差的平方和 ]
Called the sum of squared errors or sum of the squared residuals (SSE)
7
Ordinary Least Squares (OLS) Regression ( 普通最小平方法 )
• The regression line is drawn so as to minimize the sum of the squared vertical distances from the points to the line
( 讓 SSE 最小 )• This line minimize squared predictive error• This line will pass through the middle of th
e point cloud ( 迴歸線從資料群中間穿過 )(think as a nice choice to describe the relationship)
8
To describe a regression line (equation):• Algebraically, line described by its intercept ( 截
距 ) and slope ( 斜率 )• Notation:
y = the dependent variable
x = the independent variable
y_hat = predicted y, based on the regression line
β = slope of the regression line
α= intercept of the regression line
9
The meaning of slope and intercept:
• slope = change in (y_hat) for a 1 unit change in x (x 一單位的改變導致 y 估計值的變化 )• intercept = value of (y_hat) when x is 0•解釋截距與斜率時要注意到 x and y 的單位
10
General equation of a regression line: (y_hat) = α +βx
where α and β are chosen to minimize:
sum of (observed y – predicted y)2
A formula for α and β which minimize this sum is programmed into statistical programs and calculators
11
An example of a regression line
12
Fit: how much can regression explain? ( 迴歸能解釋 y 多少的變異? )
• Look at the regression equation again:
(y_hat) = α +βx
y = α +βx + ε
• Data = what we explain + what we don’t explain
• Data = predicted + residual
( 資料有我們不能解釋的與可解釋的部分,即能預估的與誤差的部分)
13
In regression, we can think “fit” in this way:
• Total variation = sum of squares of y
• explained variation = total variation explained by our predictions
• unexplained variation = sum of squares of residuals
• R2 = (explained variation)/ (total variation) (判定係數)
[y 全部的變易量中迴歸分析能解釋的部分 ]
14
R2 = r2
NOTE: a special feature of simple regression (OLS), this is not true for multiple regression or other regression met
hods. [ 注意:這是簡單迴歸分析的特性,不適用於多元迴歸分析或其他迴歸分析 ]
15
Some cautions about regression and R2
• It’s dangerous to use R2 to judge how “good” a regression is. ( 不要用 R2 來判斷迴歸的適用性 )– The “appropriateness” of regression is not a f
unction of R2
• When to use regression?– Not suitable for non-linear shapes [you can m
odify non-linear shapes]– regression is appropriate when r (correlation)
is appropriate as a measure
16
補充 : Proportional Reduction of Error (PRE)( 消減錯誤的比例 )
• PRE measures compare the errors of predictions under different prediction rules; contrasts a naïve to sophisticated rule
• R2 is a PRE measure• Naïve rule = predict y_bar• Sophisticated rule = predict y_hat• R2 measures reduction in predictive error f
rom using regression predictions as contrasted to predicting the mean of y
17
Cautions about correlation and regression:
• Extrapolation is not appropriate• Regression: pay attention to lurking or omitted
variables– Lurking (omitted) variables: having influence on the
relationship between two variables but is not included among the variables studied
– A problem in establishing causation
• Association does not imply causation.– Association alone: weak evidence about causation– Experiments with random assignment are the best
way to establish causation.
18
Inference for Simple Regression
19
Regression Equation
Equation of a regression line:(y_hat) = α +βx y = α +βx + ε
y = dependent variablex = independent variableβ = slope = predicted change in y with a one unit c
hange in xα= intercept = predicted value of y when x is 0y_hat = predicted value of dependent variable
20
Global test--F 檢定 : 檢定迴歸方程式有無解釋能力 (β= 0)
21
22
The regression model ( 迴歸模型 )
• Note: the slope and intercept of the regression line are statistics (i.e., from the sample data).
• To do inference, we have to think of α and β as estimates of unknown parameters.
23
Inference for regression
• Population regression line:
μy = α +βx
estimated from sample:
(y_hat) = a + bx
b is an unbiased estimator ( 不偏估計式 )of the true slope β, and a is an unbiased estimator of the true intercept α
24
Sampling distribution of a (intercept) and b (slope)
• Mean of the sampling distribution of a is α
• Mean of the sampling distribution of b is β
25
Sampling distribution of a (intercept) and b (slope)
• Mean of the sampling distribution of a is α
• Mean of the sampling distribution of b is β
• The standard error of a and b are related to the amount of spread about the regression line (σ)
• Normal sampling distributions; with σ estimated use t-distribution for inference
26
The standard error of the least-squares line
• Estimate σ (spread about the regression line using residuals from the regression)
• recall that residual = (y –y_hat)
• Estimate the population standard deviation about the regression line (σ) using the sample estimates
27
Estimate σ from sample data
28
Standard Error of Slope (b)
• The standard error of the slope has a sampling distribution given by:
• Small standard errors of b means our estimate of b is a precise estimate of β
• SEb is directly related to s; inversely related to sample size (n) and Sx
29
Confidence Interval for regression slope
A level C confidence interval for the slope of “true” regression line β is
b ± t * SEb
Where t* is the upper (1-C)/2 critical value from the t distribution with n-2 degrees of freedom
To test the hypothesis H0: β= 0, compute the t statistic:
t = b/ SEb
In terms of a random variable having the t,n-2 distribution
30
Significance Tests for the slope
Test hypotheses about the slope of β. Usually:
H0: β= 0 (no linear relationship between the independent and dependent variable)
Alternatives:
HA: β > 0 or HA: β < 0
or HA: β ≠ 0
31
32
Statistical inference for intercept
We could also do statistical inference for the regression intercept, α
Possible hypotheses:
H0: α = 0
HA: α≠ 0t-test based on a, very similar to prior t-tests
we have doneFor most substantive applications, interested
in slope (β), not usually interested in α
33
Example: SPSS Regression Procedures and Output
• To get a scatterplot ():
統計圖 (G) → 散佈圖 (S) → 簡單 →定義(選x 及 y )
• To get a correlation coefficient:
分析 (A) → 相關 (C) → 雙變量• To perform simple regression
分析 (A) → 迴歸方法 (R) → 線性 (L) (選 x及 y )(還可選擇儲存預測值及殘差)
34
SPSS Example: Infant mortality vs. Female Literacy, 1995 UN Data
Infant Mortality vs. Female Literacy
109 countries, 1995 UN Data
Females who read (%)
120100806040200
Infa
nt
mort
ality
(d
eath
s p
er
10
00
liv
e b
irth
s)
200
100
0
35
Example: correlation between infant mortality and female literacy
相關
1 -.843**. .000
109 85-.843** 1.000 .
85 85
Pearson 相關 ( )顯著性 雙尾
個數Pearson 相關
( )顯著性 雙尾個數
BABYMORT Infantmortality (deaths per1000 live births)
LIT_FEMA Femaleswho read (%)
BABYMORT Infant mortality(deaths per 1000
live births)
LIT_FEMA Females who
read (%)
0.01 ( )在顯著水準為 時 雙尾 ,相關顯著。**.
36
Regression: infant mortality vs. female literacy, 1995 UN Data
模式摘要b
.843a .711 .708 20.6971模式1
R R 平方調過後的R 平方 估計的標準誤
( ), LIT_FEMA Females who read (%)預測變數: 常數a. \ BABYMORT Infant mortality (deaths per 1000依變數 :
live births)b.
係數a
127.203 5.764 22.067 .000 115.738 138.668
-1.129 .079 -.843 -14.302 .000 -1.286 -.972
( )常數LIT_FEMA Femaleswho read (%)
模式1
B 之估計值 標準誤未標準化係數
Beta 分配
標準化係數
t 顯著性 下限 上限
B 95% 迴歸係數 的 信賴區間
\ BABYMORT Infant mortality (deaths per 1000 live births)依變數 :a.
37
Regression: infant mortality vs. female literacy, 1995 UN Data
模式摘要b
.843a .711 .708 20.6971模式1
R R 平方調過後的R 平方 估計的標準誤
( ), LIT_FEMA Females who read (%)預測變數: 常數a. \ BABYMORT Infant mortality (deaths per 1000依變數 :
live births)b.
係數a
127.203 5.764 22.067 .000 115.738 138.668
-1.129 .079 -.843 -14.302 .000 -1.286 -.972
( )常數LIT_FEMA Femaleswho read (%)
模式1
B 之估計值 標準誤未標準化係數
Beta 分配
標準化係數
t 顯著性 下限 上限
B 95% 迴歸係數 的 信賴區間
\ BABYMORT Infant mortality (deaths per 1000 live births)依變數 :a.
變異數分析b
87617.840 1 87617.840 204.538 .000a
35554.673 83 428.370123172.513 84
迴歸殘差總和
模式1
平方和 自由度 平均平方和 F 檢定 顯著性
( ), LIT_FEMA Females who read (%)預測變數: 常數a. \ BABYMORT Infant mortality (deaths per 1000 live births)依變數 :b.
38
Hypothesis test example
大華正在分析教育成就的世代差異,他蒐集到 117 組父子教育程度的資料。父親的教育程度是自變項,兒子的教育程度是依變項。他的迴歸公式是: y_hat = 0.2915*x +10.25
迴歸斜率的標準誤差 (standard error) 是 : 0.10
1. 在 α=0.05 ,大華可得出父親與兒子的教育程度是有關連的嗎?
2. 對所有父親的教育程度是大學畢業的男孩而言,這些男孩的平均教育程度預測值是多少?
3. 有一男孩的父親教育程度是大學畢業,預測這男孩將來的教育程度會是多少?