Outline
Outline:
1. Motivations
2. Smoothing
3. Degree of Freedom
Motivation: Why nonparametric?
Simple Linear Regression: E(Y|X)=α+βX
• assume that the mean of Y is a linear function of X
• (+)easy in computation, description, interpretation, …etc.
• (-) limit of uses
Note that the hat matrix in LSE of regression
I. symmetric and idempotent
II. constant preserving i.e. S1=1
III. = # of linearly independent predictors in a model = # of parameters in a model
)()()( SrankStrSStr T
SYXYXXXY :)'(ˆ 1
If the dependence of E(Y) on X is far from linear,
• one can extend straight-line regression by adding terms like X2 to the model
• but it is difficult to guess the most appropriate function form just from looking at the data.
Example: Diabetes data
1. Diabetes data (Sockett et al., 1987): a study of the factors affecting patterns of insulin-dependent diabetes mellitus in children.
• Response: logarithm (C-peptide concentration at diagnosis)
• Predictors: age and base deficit.
What is smoothers?
A tool for summarizing the trend of a response Y as a function of one or more predictor measurements X1,X2,…,Xp.
Idea of smoothers
Simplest Smoothers occur in the case of a categorical predictor,
Example: sex (male, female),
Example: color (red, blue, green)
To smooth Y simply average the values of Y in each category
How about non-categorical predictor?
• usually lack replicates at each predictor value
• mimic category averaging through “local averaging”i.e. average the Y values in neighborhoods around each target value
Two main uses of smoothers
I. Description: to enhance the visual appearance of the scatterplot of Y vs. X.
II. Estimate the dependence of the mean of Y on the predictor
Two main decisions to be made in scatterplot smoothing
1. how to averaging the response values in each neighborhood ?(which brand of smoother?)
2. how big to take the neighborhoods ? (smoothing parameters=?)
Scatterplot Smoothing
Notations:
• y=(y1, y2,…,yn)T
• x=( x1, x2,…,xn)T with x1< x2<…<xn
• Def: s(x0)=S(y|x=x0)
Some scatterplot smoothers :1. Bin smoothers • Choose cut points
• Def :
the indices of data points in each region.
• (-): estimate is not smooth (jumps at each cut points).
Kcc 0
};{ 1 kikk cxciR
2.Running-mean smoothers (moving average)
• Choose a symmetric nearest neighborhood • define the running mean
• (+):simple• (-):don’t work well (wiggly), severely bia
sed near the end points
)()()( ixNji yavexs
iS
)(i
SxN
3.Running-line smoothers
Def:
where and are the LSE for the data points in
(-): jagged
=>weighted LSE
)(ˆ)(ˆ)( 000 xxxs )(ˆ
0x )(ˆ0x)( 0xN S
4. Kernel smoothers
Def:
where d(t) is a smooth even function decreasing in |t|, =bandwidth, C0 chosen so that the weights sum to 1
Example. Gaussian kernel
Example. Epanechnikov kernel
Example. the minimum variance kernel
|}(|00
jo
j
xxd
CS
,
;1||
0
),1(4
3)(
2
otherwise
tforttd
,
;1||
0
),53(8
3)(
2
otherwise
tforttd
5.Running medians smoothers
Def:
• make the smoother resistant to outliers in the data
• nonlinear smoother
)()()( ixNji yMedxs
iS
6. Regression splines
The regions are separated by a sequences of knots
Piecewise polynomial
e.g. piecewise cubic polynomial
joint smoothly at these knots
ps. more knots more flexible
},...,{ 1 K
6a. piecewise-cubic spline
(1) s is cubic polynomial in any subintervals
(2) s has the two continuous derivates
(3) s has a third derivative that is step function with jumps at knots
),[ 1jj
where a+ denotes the positive part of a
• it can be rewritten as a linear combination of K+4 basis function
• de Boor (1978): B-spline basis functions
3
1
3
3
2
210 )()(K
j jj xxxxxs
3421 )()(,,)(,1)( KK xxPxxPxP
(Continue) its parametric expression
6b. Nature spline
Def: Regression spline (see 6a)
+ boundary regions 0 ff
7. Cubic smoothing splines
Find f that minimize the penalized residual sum of square
dttfxfyb
a
n
i ii
2
1
2 )}({)}({
• first term: closeness to the data
• second term: penalize curvature in the function
: (1) large values produce smoother curve
(2) small values produce wiggly curve
8. Locally-weighted running-line smoothers (loess)
Cleveland (1979)
• define N(x0)=k nearest neighbors of x0
• Using tri-cube weight function in WLSE
Smoothers for multiple predictors
1. multiple-predictor smoothers:example: kernel
(see figure) (-):difficulty of interpretation and computa
tion2. Additive model3. semi-parametric model
“Curse of dimensionality”
Neighborhoods with a fixed number of points become less local as the dimensions increase (Bellman, 1961)
•For p=1 and span=.1 should length .1.
•For p=10 the side length need to be .8.
additive model
Additive: Y = f 1(X1)+...+ fp (X2 ) + e
• The selection, estimation are usually based on the smoothing, backfitting, BRUTO, ACE, Projector, etc. (Hastie, 1990)
• Backfitting (see HT 90)
• BRUTO Algorithm (see HT 90)
is a forward model selection procedure using a modified GCV, defined latter, to choose the significant variables and their smoothing parameters.
(Smoothing in details)
assume
where ,
X independent with
)(XfY
0)( E
,)( 2 V
The bias-variance trade-off
Example. running-mean
)( 12)(ˆ
isk xNj
iik k
yxf
)( 12
)()}(ˆ{
isk xNj
iik k
xfxfE
12)}(ˆ{
2
kxfV ik
k
To expand f in Taylor series
assuming data are equally spaced with
, and ignoring R
Rxfxxxfxxxfxf iijiijij )()()()()()( 2
ii xx 1
2)(6
)1()}()(ˆ{)ˆ( iiikk xf
kkxfxfEfBias
k
and the optimal k is chosen by minimizing
as2)}()(ˆ{)( iiki xfxfExMSE
5
1
24
2
})}({2
9{
i
opt xfk
Automatic selection of smoothing parameters (1)Average mean-squared error
(2)Average predictive squared error
where is a new observation at Xi.
2
1)}()(ˆ{
1)( iik
n
ixfxfE
nMSE
2*
1)},({
1)( ii
n
ixfYE
nPSE
*
iY
Some estimates of PSE:1. CV Cross Validation (CV)
where indicates the fit at xi, compute
d by leaving out the ith data point.
2
1)}(ˆ{
1)( i
ii
n
ixfy
nCV
)(ˆi
i xf
Fact:
Since
)(}{ PSECVE
22 )}(ˆ)()({)}(ˆ{ i
i
iiii
i
i xfxfxfyExfyE
22 )}(ˆ)({ i
i
i xfxfE 22 )}(ˆ)({ ii xfxfE
2. Average squared residual(ASR)
• is not a good estimate of PSE
2
1)}(ˆ)({
1)( ii
n
ixfxf
nASR
linear smoothers
def1:
def2: where is called smoother matrix. (free of y)
e.g. running-mean, running-line, smoothing spline, kernel, loess and regression spline
)|()|()|( 2121 xybSxyaSxbyayS Syf ˆ ][ ijSS
The Bias-variance trade-off for linear smoothers
n
bb
n
SStr TT
2)(
n
bb
n
SStrPSE
TT
2)(1)(
n
i
n
i iib
nf
nMSE
1
2
1
1)ˆvar(
1)(
Cross Validation (CV)
constant preserving: weights Sij => Sij /(1-Sii)
=>
=>
=>
2
1)}(ˆ{
1)( i
ii
n
ixfy
nCV
j
n
ii
ij
i
i yijj S
Sxf
,1 )(1
)()(ˆ
)(ˆ)(,1
)()(ˆi
i
ii
n
jiji
i xfSijj
ySxf
)(1
)(ˆ)(ˆ
ii
iii
i
i S
xfyxfy
2
1}
)(1
)(ˆ{
1
ii
iin
i S
xfy
nCV
Generalized Cross Validation
2
1}
/)(1
)(ˆ{
1
nStr
xfy
nGCV ii
n
i
Degree of freedom of a smoother
Why need df? (here?)
The same data set and computational power of modern computers are used routinely in the formulation, selection, estimation, diagnostic and prediction of statistical model.
)( Strdf 1. 2. 3.
)2( Terr SSStrndf
)(var TSStrdf
EDF (Ye, 1998)
Idea: A modeling / forecast procedure
said to be stable if small changes in Y produces small changes in the fitted values .
nnn RRYfYfYfY :))'(,),(()(ˆ
1
More precisely (EDF) for ,
we would like to have,
where is a small matrix.
=> can be viewed as the slope of the straight line
nn R )',,( 1
HYfYfYY )()((ˆ)ˆ(
nji
i
iij Y
YhH ,,1,]
ˆ[][
iih
iiiii hYfYf )()(
Data Perturbation Procedure
For an integer m > 1 (the Monte Carlo sample size), generateδ1, ...,δm as i.i.d. N(0, t2In) where t > 0 and In is the n×n identity matrix.
• Use the “perturbed” data Y +δj, to refit
• For i =1,2, ..., n, the slope of the LS line fitted to ( (Yi +δij), δij), j=1, ..., m, gives an estimate of hii.
An application
Table 1. MSE & SD of five models fitted to lynx data
About SD: Fit the same class of models to the first 100 obs., keeping the last 100 for out-of-sample predictions. SD = the standard deviation of the multi-step ahead prediction errors.
Model AR(2) SETAR ADD(1,2) ADD(1,2,9) PPR
MSE 0.0459 0.0358 0.0455 0.038 0.0194
MSEadj 0.0443 0.0365 0.0377
SD 0.295 0.136 0.100 0.347 0.247