An Introduction to GAMs based on penalized regression splinessw15190/talks/snw-Koln.pdf · 2012-05-08 · An Introduction to GAMs based on penalized regression splines Simon Wood

An Introduction to GAMs based on penalizedregression splines

Simon WoodMathematical Sciences, University of Bath, U.K.

Generalized Additive Models (GAM)

I A GAM has a form something like:

g{E(yi)} = ηi = X∗i β∗ + f1(x1i) + f2(x2i , x3i) + f3(x4i) + · · ·

I g is a known link function.I yi independent with some exponential family distribution.

Crucially this ⇒ var(yi) = V (E(yi))φ, where V is adistribution dependent known function.

I fj are smooth unknown functions (subject to centeringconditions).

I X∗β∗ is parametric bit.

I i.e. a GAM is a GLM where the linear predictor depends onsmooth functions of covariates.

GAM Representation and estimation

I Originally GAMs were estimated by backfitting, with anyscatterplot smoother used to estimate the fj .

I . . . but it was difficult to estimate the degree of smoothness.I Now the tendency is to represent the fj using basis

expansions of moderate size, and to apply tuneablequadratic penalties to the model likelihood, to avoid overfit.

I . . . this makes it easier to estimate degree of smoothness,by estimating the tuning parameters/ smoothingparameters.

Example basis-penalty: P-splinesI Eilers and Marx have popularized the use of B-spline

bases with discrete penalties.I If bk (x) is a B-spline and βk an unknown coefficient, then

f (x) =K∑

k

βk bk (x).

I Wiggliness can be penalized by e.g.

P =K−1∑

k=2

(βj−1 − 2βj + βj+1)2 = βTSβ.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

x

b k(x

)

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

x

f(x)

Other reduced rank splines

I Reduced rank versions of splines with derivative basedpenalties often have slightly better MSE performance.

I e.g. Choose a set of knots, x∗k spread nicely in the range ofthe covariate, x , and obtain the cubic spline basis basedon the x∗k . i.e. the basis that arises by minimizing, e.g.

∑

k

{y∗k − f (x∗k )}2 + λ

∫f ′′(x)2dx w.r.t. f .

I Choosing the knot locations for any penalized spline typesmoother is rather arbitrary. It can be avoided by taking areduced rank eigen approximation to a full spline, (actuallyget an optimal low rank basis this way).

Rank 15 Eigen Approx to 2D TPS

z

x

z

x

z

x

z

x

z

x

z

x

z

x

z

x

z

x

z

x

z

x

z

x

z

x

z

x

z

x

z

x

f

Other basis penalty smoothers

I Many other basis-penalty smoothers are possible.I Tensor product smooths are basis-penalty smooths of

several covariates, constructed (automatically) fromsmooths of single covariates.

I Tensor product smooths are immune to covariaterescaling, provided that they are multiply penalized.

I Finite area smoothing is also possible (look out for Soapfilm smoothing).

Estimation

I Whatever the basis, the GAM becomes g{E(yi)} = Xiβ, arichly parameterized GLM.

I To avoid overfit, estimate β to minimize

D(β) +∑

j

λjβTSjβ

— the penalized deviance. λj control fit-smoothness(variance-bias) tradeoff.

I Can get this objective ‘directly’ or by putting a prior onfunction wiggliness ∝ exp(−∑

λjβTSjβ/2).

I So GAM is also a GLMM and λj are variance parameters.I Given λj actual β fitting is by a Penalized version of IRLS

(Fisher scoring or full Newton), or by MCMC.

Smoothness selection

I Various criteria can be minimized for λ selection/estimationI Cross validation leads to a GCV criterion

Vg = D(β̂)/ {n − tr(A)}2

I AIC or Mallows’ Cp leads to Va = D(β̂) + 2tr(A)φ.I Taking the Bayesian/mixed model approach seriously, a

REML based criteria is

Vr = D(β̂)/φ + β̂TSβ/φ + log |XTWX + S| − log |S|+ − 2ls

I . . . W is the diagonal matrix of IRLS weights, S =∑

j λjSj ,A = X(XTWX + S)−1XTW, the trace of which is the modelEDF and ls is the saturated log likelihood.

Numerical Methods for optimizing λ

I All criteria can reliably be optimized by Newton’s method(outer to PIRLS β estimation).

I Need derivatives of V wrt log(λj)for this. . .1. Get derivatives of β̂ w.r.t. log(λj) by differentiating PIRLS or

by Implicit Function Theorem approach.2. Given these we can get the derivatives of W and hence

tr(A) w.r.t. the log(λj) as well as the derivatives of D.I Derivatives of GCV, AIC and REML have very similar

ingredients.I Some care is needed to ensure maximum efficiency and

stability.I MCMC and boosting offer alternatives for ‘estimating’ λ, β.

Why worry about stability? A simple example

The x,y, data on the left were modelled using the cubic splineon the right (full TPS basis, λ chosen by GCV).

0.0 0.2 0.4 0.6 0.8 1.0

−2

02

46

8

x

y

0.0 0.2 0.4 0.6 0.8 1.0

−2

02

46

8

x

s(x,

15.0

7)The next slide compares GCV calculation based on the naı̈ve‘normal equations’ calculation β̂ = (XTX + λS)−1XTy with astable QR based alternative. . .

Stability matters for λ selection!

−15 −10 −5 0 5 10 15

01

23

45

67

log10(λ)

log(

GC

V)

Normal equations Choleski

Normal equations SVD

Stable QR method

. . . automatic minimization of the red or green versions of GCVis not a good idea.

GAM inference

I The best calibrated inference, in a frequentist sense,seems to arise by taking a Bayesian approach.

I Recall the prior on function wiggliness

∝ exp(−1

2

∑λjβ

TSjβ

)

— an improper Gaussian on β.I Bayes’ rule and some asymptotics then ⇒

β|y ∼ N(β̂, (XTWX +∑

λjSj)−1φ)

I Posterior ⇒ e.g. CIs for fj , but can also simulate fromposterior very cheaply, to make inferences about anythingthe GAM predicts.

GAM inference II

I The Bayesian CIs have good across the functionfrequentist coverage probabilities, provided the smoothingbias is somewhat less than the variance.

I Neglect of smoothing parameter uncertainty is not veryimportant here.

I An extension of Nychka (1988; JASA) is the key tounderstanding these results.

I P-values for testing model components for equality to zeroare also possible, by ‘inverting’ Bayesian CI for thecomponent. P-value properties are less good than CIs.

Example: retinopathy data0 10 20 30 40 50

0.0

0.4

0.8

10 15 20 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

ret

2030

4050

bmi

1015

20

gly

0 10 20 30 40 50

020

40

dur

Retinopathy models?

I Question: How is development of retinopathy in diabeticsrelated to duration of disease at baseline, body massindex (bmi) and percentage glycosylated haemoglobin?

I A possible model is

logit{E(ret)} = f1(dur) + f2(bmi) + f3(gly)

+ f1(dur,bmi) + f2(dur,gly) + f3(gly,bmi)

where ret ∼ Bernoulli.I In R, model is fit with something like

gam(ret ∼ te(dur)+te(gly)+te(bmi)+te(dur,gly)+te(dur,bmi)+te(gly,bmi),family=binomial)

Retinopathy Estimated effects

0 10 20 30 40 50

−4

−2

02

46

dur

s(du

r,3.

26)

10 15 20−

4−

20

24

6

gly

s(gl

y,1)

20 30 40 50

−4

−2

02

46

bmi

s(bm

i,2.6

7)dur

gly

te(dur,gly,0)

dur

bmi

te(dur,bmi,0)

gly

bmi

te(gly,bmi,2.5)

Retinopathy GLY-BMI interaction

bmi

gly

linear predictor

15 20 25 30 35 40 45 50

1015

20

linear predictor

bmi

gly

bmi

gly

linear predictor

red/green are +/− TRUE s.e.

bmi

gly

linear predictor


bmi

gly

linear predictor


GAM ‘extensions’

I To obtain a satisfactory framework for generalized additivemodelling has required solving a rather more generalestimation problem . . .

I GAM framework can cope with any quadratically penalizedGLM where smoothing parameters enter the objectivelinearly. Consequently the following examples extensionscan all be used without new theory. . .

I Varying coefficient models, where a coefficient in a GLM isallowed to vary smoothly with another covariate.

I Model terms involving any linear functional of a smoothfunction, for example functional GLMs.

I Simple random effects, since a random effect can betreated as a smooth.

I Adaptive smooths can be constructed by using multiplepenalties for a smooth.

Example: functional covariatesI Consider data on 150 functions, xi(t), (each observed at

tT = (t1, . . . , t200)), with corresponding noisy univariateresponse, yi .

I First 9 (xi(t), yi) pairs are . . .

04

812

x

208.8 47 121.8

04

812

x

125.4 43.4 15.3

0.0 0.2 0.4 0.6 0.8 1.0

04

812

t

x

123

0.0 0.2 0.4 0.6 0.8 1.0

t

7.7

0.0 0.2 0.4 0.6 0.8 1.0

t

109.4

F-GLM

I An appropriate model might be the functional GLM

g{E(yi)} =

∫f (t)xi(t)dt

where predictor xi is a known function and f (t) is anunknown smooth regression coefficient function.

I Typically f and xi are discretized so that g{E(yi)} = fTxiwhere fT = [f (t1), f (t2) . . .] and xT

i = [xi(t1), xi(t2) . . .].I Generically this is an example of dependence on a linear

functional of a smooth.I R package mgcv has a simple ‘summation convention’

mechanism to handle such terms. . .

FGLM fitting

I Want to estimate smooth, f , in model yi =∫

f (t)xi(t)dt + εi .I gam(y∼s(T,by=X)) will do this, if T and X are matrices.I i th row of X is the observed (discretized) function xi(t).

Each row of T is a replicate of the observation time vector t.

0.0 0.2 0.4 0.6 0.8 1.0

−0.

40.

00.

20.

40.

60.

8

t

s(T

,6.5

4):X

Adaptive smoothingI Perhaps I don’t like this P-spline smooth of ‘the’ motorcycle

crash data. . .

10 20 30 40 50

−10

0−

500

5010

0

gam(accel~s(times,bs="ps",k=40))

times

s(tim

es,1

0.29

)

I Should I really use adaptive smoothing?

Adaptive smoothing 2

I P-splines and the preceding GAM framework make it veryeasy to do adaptive smoothing.

I Use a B-spline basis f (x) =∑

βjbj(x), with an adaptivepenalty P =

∑K−1k=2 ck (βk−1 − 2βk + βk+1)

2, where ckvaries smoothly with k and hence x .

I Defining dk = βk−1 − 2βk + βk+1, and D to be the matrixsuch that d = Dβ, we have P = βTDTdiag(c)Dβ.

I Now use a (B-spline) basis expansion for c so that c = Cλ.I Then P =

∑j λjβ

TDTdiag(C·j)Dβ.I i.e. the adaptive P-spline is just a P-spline with multiple

penalties.

Adaptive smoothing 3

I R package mgcv has an adaptive P-spline class. Using itdoes give some improvement . . .

10 20 30 40 50

−10

0−

500

5010

0

gam(accel~s(times,bs="ad",k=40))

times

s(tim

es,8

.56)

Conclusions

I Penalized regression splines are the starting point for afairly complete framework for Generalized AdditiveModelling.

I The numerical methods and theory developed for thisframework are applicable to any quadratically penalizedGLM, so many extensions of ‘standard’ GAMs are possible.

I The R package mgcv tries to exploit the generality of theframework, so that almost any quadratically penalizedGLM can readily be used.

References

I Hastie and Tibshirani (1986) invented GAMs. The work ofWahba (e.g. 1990) and Gu (e.g. 2002) heavily influencedthe work presented here. Duchon (1977) invented thinplate splines. The Retinopathy data are from Gu.

I Penalized regression splines go back to Wahba (1980), butwere given real impetus by Eilers and Marx (1996) and in aGAM context by Marx and Eilers (1998).

I See Wood (2006) Generalized Additive Models: AnIntroduction with R, CRC for more information. Wood(2008; JRSSB) is more up to date on numerical methods.

I The mgcv package in R implements everything coveredhere.

Documents

An Introduction to GAMs based on penalized regression splinessw15190/talks/snw-Koln.pdf · 2012-05-08 · An Introduction to GAMs based on penalized regression splines Simon Wood