Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
An Introduction to GAMs based on penalizedregression splines
Simon WoodMathematical Sciences, University of Bath, U.K.
Generalized Additive Models (GAM)
I A GAM has a form something like:
g{E(yi)} = ηi = X∗i β∗ + f1(x1i) + f2(x2i , x3i) + f3(x4i) + · · ·
I g is a known link function.I yi independent with some exponential family distribution.
Crucially this ⇒ var(yi) = V (E(yi))φ, where V is adistribution dependent known function.
I fj are smooth unknown functions (subject to centeringconditions).
I X∗β∗ is parametric bit.
I i.e. a GAM is a GLM where the linear predictor depends onsmooth functions of covariates.
GAM Representation and estimation
I Originally GAMs were estimated by backfitting, with anyscatterplot smoother used to estimate the fj .
I . . . but it was difficult to estimate the degree of smoothness.I Now the tendency is to represent the fj using basis
expansions of moderate size, and to apply tuneablequadratic penalties to the model likelihood, to avoid overfit.
I . . . this makes it easier to estimate degree of smoothness,by estimating the tuning parameters/ smoothingparameters.
Example basis-penalty: P-splinesI Eilers and Marx have popularized the use of B-spline
bases with discrete penalties.I If bk (x) is a B-spline and βk an unknown coefficient, then
f (x) =K∑
k
βk bk (x).
I Wiggliness can be penalized by e.g.
P =K−1∑
k=2
(βj−1 − 2βj + βj+1)2 = βTSβ.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
x
b k(x
)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
x
f(x)
Other reduced rank splines
I Reduced rank versions of splines with derivative basedpenalties often have slightly better MSE performance.
I e.g. Choose a set of knots, x∗k spread nicely in the range ofthe covariate, x , and obtain the cubic spline basis basedon the x∗k . i.e. the basis that arises by minimizing, e.g.
∑
k
{y∗k − f (x∗k )}2 + λ
∫f ′′(x)2dx w.r.t. f .
I Choosing the knot locations for any penalized spline typesmoother is rather arbitrary. It can be avoided by taking areduced rank eigen approximation to a full spline, (actuallyget an optimal low rank basis this way).
Rank 15 Eigen Approx to 2D TPS
z
x
z
x
z
x
z
x
z
x
z
x
z
x
z
x
z
x
z
x
z
x
z
x
z
x
z
x
z
x
z
x
f
Other basis penalty smoothers
I Many other basis-penalty smoothers are possible.I Tensor product smooths are basis-penalty smooths of
several covariates, constructed (automatically) fromsmooths of single covariates.
I Tensor product smooths are immune to covariaterescaling, provided that they are multiply penalized.
I Finite area smoothing is also possible (look out for Soapfilm smoothing).
Estimation
I Whatever the basis, the GAM becomes g{E(yi)} = Xiβ, arichly parameterized GLM.
I To avoid overfit, estimate β to minimize
D(β) +∑
j
λjβTSjβ
— the penalized deviance. λj control fit-smoothness(variance-bias) tradeoff.
I Can get this objective ‘directly’ or by putting a prior onfunction wiggliness ∝ exp(−∑
λjβTSjβ/2).
I So GAM is also a GLMM and λj are variance parameters.I Given λj actual β fitting is by a Penalized version of IRLS
(Fisher scoring or full Newton), or by MCMC.
Smoothness selection
I Various criteria can be minimized for λ selection/estimationI Cross validation leads to a GCV criterion
Vg = D(β̂)/ {n − tr(A)}2
I AIC or Mallows’ Cp leads to Va = D(β̂) + 2tr(A)φ.I Taking the Bayesian/mixed model approach seriously, a
REML based criteria is
Vr = D(β̂)/φ + β̂TSβ/φ + log |XTWX + S| − log |S|+ − 2ls
I . . . W is the diagonal matrix of IRLS weights, S =∑
j λjSj ,A = X(XTWX + S)−1XTW, the trace of which is the modelEDF and ls is the saturated log likelihood.
Numerical Methods for optimizing λ
I All criteria can reliably be optimized by Newton’s method(outer to PIRLS β estimation).
I Need derivatives of V wrt log(λj)for this. . .1. Get derivatives of β̂ w.r.t. log(λj) by differentiating PIRLS or
by Implicit Function Theorem approach.2. Given these we can get the derivatives of W and hence
tr(A) w.r.t. the log(λj) as well as the derivatives of D.I Derivatives of GCV, AIC and REML have very similar
ingredients.I Some care is needed to ensure maximum efficiency and
stability.I MCMC and boosting offer alternatives for ‘estimating’ λ, β.
Why worry about stability? A simple example
The x,y, data on the left were modelled using the cubic splineon the right (full TPS basis, λ chosen by GCV).
0.0 0.2 0.4 0.6 0.8 1.0
−2
02
46
8
x
y
0.0 0.2 0.4 0.6 0.8 1.0
−2
02
46
8
x
s(x,
15.0
7)The next slide compares GCV calculation based on the naı̈ve‘normal equations’ calculation β̂ = (XTX + λS)−1XTy with astable QR based alternative. . .
Stability matters for λ selection!
−15 −10 −5 0 5 10 15
01
23
45
67
log10(λ)
log(
GC
V)
Normal equations Choleski
Normal equations SVD
Stable QR method
. . . automatic minimization of the red or green versions of GCVis not a good idea.
GAM inference
I The best calibrated inference, in a frequentist sense,seems to arise by taking a Bayesian approach.
I Recall the prior on function wiggliness
∝ exp(−1
2
∑λjβ
TSjβ
)
— an improper Gaussian on β.I Bayes’ rule and some asymptotics then ⇒
β|y ∼ N(β̂, (XTWX +∑
λjSj)−1φ)
I Posterior ⇒ e.g. CIs for fj , but can also simulate fromposterior very cheaply, to make inferences about anythingthe GAM predicts.
GAM inference II
I The Bayesian CIs have good across the functionfrequentist coverage probabilities, provided the smoothingbias is somewhat less than the variance.
I Neglect of smoothing parameter uncertainty is not veryimportant here.
I An extension of Nychka (1988; JASA) is the key tounderstanding these results.
I P-values for testing model components for equality to zeroare also possible, by ‘inverting’ Bayesian CI for thecomponent. P-value properties are less good than CIs.
Example: retinopathy data0 10 20 30 40 50
0.0
0.4
0.8
10 15 20 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
ret
2030
4050
bmi
1015
20
gly
0 10 20 30 40 50
020
40
dur
Retinopathy models?
I Question: How is development of retinopathy in diabeticsrelated to duration of disease at baseline, body massindex (bmi) and percentage glycosylated haemoglobin?
I A possible model is
logit{E(ret)} = f1(dur) + f2(bmi) + f3(gly)
+ f1(dur,bmi) + f2(dur,gly) + f3(gly,bmi)
where ret ∼ Bernoulli.I In R, model is fit with something like
gam(ret ∼ te(dur)+te(gly)+te(bmi)+te(dur,gly)+te(dur,bmi)+te(gly,bmi),family=binomial)
Retinopathy Estimated effects
0 10 20 30 40 50
−4
−2
02
46
dur
s(du
r,3.
26)
10 15 20−
4−
20
24
6
gly
s(gl
y,1)
20 30 40 50
−4
−2
02
46
bmi
s(bm
i,2.6
7)dur
gly
te(dur,gly,0)
dur
bmi
te(dur,bmi,0)
gly
bmi
te(gly,bmi,2.5)
Retinopathy GLY-BMI interaction
bmi
gly
linear predictor
15 20 25 30 35 40 45 50
1015
20
linear predictor
bmi
gly
bmi
gly
linear predictor
red/green are +/− TRUE s.e.
bmi
gly
linear predictor
red/green are +/− TRUE s.e.
bmi
gly
linear predictor
red/green are +/− TRUE s.e.
GAM ‘extensions’
I To obtain a satisfactory framework for generalized additivemodelling has required solving a rather more generalestimation problem . . .
I GAM framework can cope with any quadratically penalizedGLM where smoothing parameters enter the objectivelinearly. Consequently the following examples extensionscan all be used without new theory. . .
I Varying coefficient models, where a coefficient in a GLM isallowed to vary smoothly with another covariate.
I Model terms involving any linear functional of a smoothfunction, for example functional GLMs.
I Simple random effects, since a random effect can betreated as a smooth.
I Adaptive smooths can be constructed by using multiplepenalties for a smooth.
Example: functional covariatesI Consider data on 150 functions, xi(t), (each observed at
tT = (t1, . . . , t200)), with corresponding noisy univariateresponse, yi .
I First 9 (xi(t), yi) pairs are . . .
04
812
x
208.8 47 121.8
04
812
x
125.4 43.4 15.3
0.0 0.2 0.4 0.6 0.8 1.0
04
812
t
x
123
0.0 0.2 0.4 0.6 0.8 1.0
t
7.7
0.0 0.2 0.4 0.6 0.8 1.0
t
109.4
F-GLM
I An appropriate model might be the functional GLM
g{E(yi)} =
∫f (t)xi(t)dt
where predictor xi is a known function and f (t) is anunknown smooth regression coefficient function.
I Typically f and xi are discretized so that g{E(yi)} = fTxiwhere fT = [f (t1), f (t2) . . .] and xT
i = [xi(t1), xi(t2) . . .].I Generically this is an example of dependence on a linear
functional of a smooth.I R package mgcv has a simple ‘summation convention’
mechanism to handle such terms. . .
FGLM fitting
I Want to estimate smooth, f , in model yi =∫
f (t)xi(t)dt + εi .I gam(y∼s(T,by=X)) will do this, if T and X are matrices.I i th row of X is the observed (discretized) function xi(t).
Each row of T is a replicate of the observation time vector t.
0.0 0.2 0.4 0.6 0.8 1.0
−0.
40.
00.
20.
40.
60.
8
t
s(T
,6.5
4):X
Adaptive smoothingI Perhaps I don’t like this P-spline smooth of ‘the’ motorcycle
crash data. . .
10 20 30 40 50
−10
0−
500
5010
0
gam(accel~s(times,bs="ps",k=40))
times
s(tim
es,1
0.29
)
I Should I really use adaptive smoothing?
Adaptive smoothing 2
I P-splines and the preceding GAM framework make it veryeasy to do adaptive smoothing.
I Use a B-spline basis f (x) =∑
βjbj(x), with an adaptivepenalty P =
∑K−1k=2 ck (βk−1 − 2βk + βk+1)
2, where ckvaries smoothly with k and hence x .
I Defining dk = βk−1 − 2βk + βk+1, and D to be the matrixsuch that d = Dβ, we have P = βTDTdiag(c)Dβ.
I Now use a (B-spline) basis expansion for c so that c = Cλ.I Then P =
∑j λjβ
TDTdiag(C·j)Dβ.I i.e. the adaptive P-spline is just a P-spline with multiple
penalties.
Adaptive smoothing 3
I R package mgcv has an adaptive P-spline class. Using itdoes give some improvement . . .
10 20 30 40 50
−10
0−
500
5010
0
gam(accel~s(times,bs="ad",k=40))
times
s(tim
es,8
.56)
Conclusions
I Penalized regression splines are the starting point for afairly complete framework for Generalized AdditiveModelling.
I The numerical methods and theory developed for thisframework are applicable to any quadratically penalizedGLM, so many extensions of ‘standard’ GAMs are possible.
I The R package mgcv tries to exploit the generality of theframework, so that almost any quadratically penalizedGLM can readily be used.
References
I Hastie and Tibshirani (1986) invented GAMs. The work ofWahba (e.g. 1990) and Gu (e.g. 2002) heavily influencedthe work presented here. Duchon (1977) invented thinplate splines. The Retinopathy data are from Gu.
I Penalized regression splines go back to Wahba (1980), butwere given real impetus by Eilers and Marx (1996) and in aGAM context by Marx and Eilers (1998).
I See Wood (2006) Generalized Additive Models: AnIntroduction with R, CRC for more information. Wood(2008; JRSSB) is more up to date on numerical methods.
I The mgcv package in R implements everything coveredhere.