Download pdf - Lecture 5: GPs and Streaming regression - GitHub PagesLecture 5: GPs and Streaming regression Gaussian Processes Information gain Con dence intervals COMP-652 and ECSE-608, Lecture

Lecture 5: GPs and Streaming regression

• Gaussian Processes

• Information gain

• Confidence intervals

COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1

Recall: Non-parametric regression

• Input space X ⊂ Rn, target space Y = R• Input matrix X of size m× n, target vector y of size m× 1

• Feature mapping φ : X 7→ Rd

• Assumption: y = Φw

• Minimize

Jλ(w) =1

2(Φw − y)>(Φw − y) +

λ

2w>w λ ≥ 0

• The solution is

w = (Φ>Φ + λIn)−1Φ>y = Φ>(ΦΦ> + λIm)−1y


Recall: Kernel regression

• Let K = ΦΦ> and k(x) = φ(x)>Φ> be the kernel matrix/vector:

K =

k(x1,x1) k(x1,x2) . . . k(x1,xm)k(x2,x1) k(x2,x2) . . . k(x2,xm)

... ... ...k(xm,x1) k(xm,x2) . . . k(xm,xm)

k(x) =

k(x,x1)k(x,x2)

...k(x,xm)

• The predictions for the input data are given by

y = ΦΦ>(ΦΦ> + λIm)−1y = K(K + λIm)−1y

• The prediction for a new input point x is given by

f(x) = φ(x)>Φ>(K + λIm)−1y = k(x)(K + λIm)−1y


Recall: Bayesian view of regression

• Consider noisy observations y = f(x) + ε = φ(x)>w + ε

• Recall Bayes’ rule: posterior = likelihood×priormarginal likelihood

Pφ(w|y,X) =Pφ(y|X,w)P (w)

Pφ(y|X)

⇒ Marginal likelihood is independent of weights w

• With Gaussian noise ε ∼ N (0, σ2)

Pφ(y|X,w) =

m∏i=1

Pφ(yi|xi,w) =

m∏i=1

1√2πσ

exp

(−(yi − φ(xi)

>w)2

2σ2

)

=1

(√

2πσ)mexp

(−‖y −Φw‖2

2σ2

)= Nm

(Φw, σ2Im

)COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 4

Posterior distribution on parameters

• With Gaussian prior on parameters w ∼ Nd(0,Σw)

Pφ(w|y,X) ∝ exp

(−‖y −Φw‖2

2σ2

)exp

(−w>Σ−1w w

2

)= exp

(−y>y − y>Φw −w>Φy + w>Φ>Φw + σ2w>Σ−1w w

2σ2

)= exp

(−y>y − y>Φw −w>Φy + w>(Φ>Φ + σ2Σ−1w )w

2σ2

)∝ exp

((w − b)>A−1(w − b)

)where A−1 = σ−2(Φ>Φ + σ2Σ−1w ) and b = (Φ>Φ + σ2Σ−1w )−1Φ>y

⇒ The posterior distribution is Gaussian!


Predictive distribution

• The pointwise posterior predictive distribution is a normal distribution

f(x)|x1, . . . ,xm, y1, . . . , ym ∼ N(f(x), s2(x)

)of expectation

f(x) = φ(x)>(Φ>Φ + σ2Σ−1w )−1Φ>y

= φ(x)>ΣwΦ>(ΦΣwΦ> + σ2Im)−1y

and variance

s2(x) = σ2φ(x)>(Φ>Φ + σ2Σ−1w )−1φ(x)

= φ(x)>Σwφ(x)− φ(x)>ΣwΦ>(Φ>ΣwΦ + σ2Im)−1ΦΣwφ(x)

→ using Sherman-Morrison


Reinterpreting regularization

• Recall kernel regression predictions:

f(x) = k(x)>(K + λIm)−1y

• Using prior Σw = σ2

λ Id, the predictive mean rewrites as:

f(x) = φ(x)>ΣwΦ>(ΦΣwΦ> + σ2Im)−1y

= φ(x)>σ2

λΦ>

(Φσ2

λΦ> + σ2Im

)−1y

= k(x)>(K + λIm)−1y

⇒ λ encodes some prior on weights w


Reinterpreting regularization (cont’d)

• Still using Σw = σ2

λ Id, the predictive variance rewrites as:

s2(x) = φ(x)>Σwφ(x)− φ(x)>ΣwΦ>(Φ>ΣwΦ + σ2Im)−1ΦΣwφ(x)

= φ(x)>σ2

λφ(x)− φ(x)>

σ2

λΦ>

(Φ>

σ2

λΦ + σ2Im

)−1Φσ2

λφ(x)

=σ2

λkλ(x,x) with

kλ(x,x′) = k(x,x′)− k(x)> (K + λIm)−1

k(x′)


Summary

• Using prior Σw = σ2

λ Id, we have

f(x) = k(x)>(K + λIm)−1y

s2(x) =σ2

λkλ(x,x) with

kλ(x,x′) = k(x,x′)− k(x)> (K + λIm)−1

k(x′)

• Providing pointwise posterior prediction

f(x)|x1, . . . ,xm, y1, . . . , ym ∼ N(f(x), s2(x)

)⇒ What does it mean to use λ ∈ R≥0?


Pointwise posterior distribution

• At each point x ∈ X , we have a distribution N(f(x), s2(x)

)• We can sample from these f(x) ∼ N

(f(x), s2(x)

)

−1.0 −0.5 0.0 0.5 1.0

X

−1.0

−0.5

0.0

0.5

1.0

Y

f s


Joint distribution

• Suppose you query your model at locations X∗• Extend the prior to include query points:[

ff∗

]|X,X∗ ∼ Nm+m∗

(0,

[KX,X KX,X∗KX∗,X KX∗,X∗

])y|f ∼ Nm(f , σ2Im)

• Using joint normality of f∗ and y:[yf∗

]∼ Nm+m∗

(0,

[KX,X + σ2Im KX,X∗

KX∗,X KX∗,X∗

])


Gaussian Process (GP)

• By considering the covariance between every points in the space, we geta distribution over functions!

• Posterior distribution on f :

P [f |X,y] ∼ N|X |([f(x)

]x∈X

,σ2

λ[kλ(x,x′)]x,x′∈X

)

−1.0 −0.5 0.0 0.5 1.0

X

−1.0

−0.5

0.0

0.5

1.0

Y

f s


Sampling from a Gaussian Process

• Generalization of normal probability distribution to the function space

– From a normal distribution we sample variables– From a GP we sample functions!

−1.0 −0.5 0.0 0.5 1.0

X

−1

0

1

2Y

Sampled f


Sampling from a Gaussian Process: How to

• Observe that

N|X |([f(x)

]x∈X

,σ2


)defines a |X |-dimensional multivariate Gaussian distribution

→ What if |X | =∞ (e.g. X = [−1, 1])?

• We can consider a discrete, finite, set X ⊂ X and sample from

N|X|([f(x)

]x∈X

,σ2


)

• This will result in a function f evaluated at every x ∈ X


Learning the hyperparameters

• If we assume that Σw = Id, then we have λ = σ2

• Let θ denote the kernel hyperparameters (e.g. ρ)

• In practice you may not know the noise and the optimal kernel hypers

• Recall: multivariate normal density

P (y|θ) =exp

(−1

2y>(Kθ + σ2Im)−1y

)√(2π)D|Kθ + σ2Im|

• Maximize the marginal likelihood L = logP (y|θ):

L = −1

2y>(Kθ + σ2Im)−1y − D

2log(2π)− 1

2log |Kθ + σ2Im|


Anatomy of marginal likelihood

• Marginal likelihood:

L = logP (y|θ) ∝ −1

2y>(Kθ + σ2Im)−1y − 1

2log |Kθ + σ2Im|

• 1st term: quality of predictions; 2nd term: model complexity• Trade-off (from Rasmussen & Williams, 2006):

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 026218253X. c� 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml

5.4 Model Selection for GP Regression 113

100−100

−80

−60

−40

−20

0

20

40

log

prob

abilit

y

characteristic lengthscale

minus complexity penaltydata fitmarginal likelihood

100−100

−80

−60

−40

−20

0

20

Characteristic lengthscale

log

mar

gina

l lik

elih

ood

95% conf int

82155

(a) (b)

Figure 5.3: Panel (a) shows a decomposition of the log marginal likelihood intoits constituents: data-fit and complexity penalty, as a function of the characteristiclength-scale. The training data is drawn from a Gaussian process with SE covariancefunction and parameters (`,�f ,�n) = (1, 1, 0.1), the same as in Figure 2.5, and we arefitting only the length-scale parameter ` (the two other parameters have been set inaccordance with the generating process). Panel (b) shows the log marginal likelihoodas a function of the characteristic length-scale for di↵erent sizes of training sets. Alsoshown, are the 95% confidence intervals for the posterior length-scales.

and we re-state the result here

log p(y|X,✓) = �1

2y>K�1

y y � 1

2log |Ky|� n

2log 2⇡, (5.8)

where Ky = Kf +�2nI is the covariance matrix for the noisy targets y (and Kf

is the covariance matrix for the noise-free latent f), and we now explicitly writethe marginal likelihood conditioned on the hyperparameters (the parameters ofthe covariance function) ✓. From this perspective it becomes clear why we calleq. (5.8) the log marginal likelihood, since it is obtained through marginaliza- marginal likelihood

tion over the latent function. Otherwise, if one thinks entirely in terms of thefunction-space view, the term “marginal” may appear a bit mysterious, andsimilarly the “hyper” from the ✓ parameters of the covariance function.4

The three terms of the marginal likelihood in eq. (5.8) have readily inter- interpretation

pretable roles: the only term involving the observed targets is the data-fit�y>K�1

y y/2; log |Ky|/2 is the complexity penalty depending only on the co-variance function and the inputs and n log(2⇡)/2 is a normalization constant.In Figure 5.3(a) we illustrate this breakdown of the log marginal likelihood.The data-fit decreases monotonically with the length-scale, since the model be-comes less and less flexible. The negative complexity penalty increases with thelength-scale, because the model gets less complex with growing length-scale.The marginal likelihood itself peaks at a value close to 1. For length-scalessomewhat longer than 1, the marginal likelihood decreases rapidly (note the

4Another reason that we like to stick to the term “marginal likelihood” is that it is thelikelihood of a non-parametric model, i.e. a model which requires access to all the trainingdata when making predictions; this contrasts the situation for a parametric model, which“absorbs” the information from the training data into its (posterior) parameter (distribution).This di↵erence makes the two “likelihoods” behave quite di↵erently as a function of ✓.


Gradient-based optimization

• Compute gradients:

∂L∂θi

=1

2y>(Kθ + σ2Im)−1

∂(Kθ + σ2Im)

∂θi(Kθ + σ2Im)−1y

− 1

2Tr

((Kθ + σ2Im)−1

∂(Kθ + σ2Im)

∂θi

)

• Minimize the negative

• Non-convex optimization task


Summary

• Normal priors on the weights distribution → Gaussian Process

• Regularization → prior on the weights covariance

• GP provides a posterior distribution on functions

– Expectation: kernel regression model– Covariance → confidence intervals

• Sample discretized functions from a GP


Typical supervised setting

• Have dataset (X,y) of previously acquired data

• Learn model w on (X,y)

• Apply model w to provide predictions at new data points

• Samples (X,y) are often assumed to be i.i.d.


Streaming setting

• Start with (possibly empty) dataset (X0, y0)

• For each time step t = 1, 2, . . . :

– Fit model wt on Xt−1 and yt−1– Acquire a new sample (xt, yt) (possibly using wt)– Define Xt = [xi]i=1...t, yt = [yi]i=1...t

• Samples may be dependent of model (not i.i.d.)

→ Samples influence model / Samples depend on model


Application: Online function optimization

0.0 0.5 1.0

X

0.0

0.5

1.0

Y

?

• Unknown function f : X 7→ R• Sequentially select locations (xt)t≥1 at which to observe the function yt

• Try to maximize/minimize the observations


Example

What is the optimal treatment dosage for a given disease?

• For each time step t = 1, 2, . . . :

– New patient t comes in– We decide on treatment dosage xt– We observe the patient’s response to treatement, yt

• Our goal is to cure patients as effectively as possible

• What you don’t want: give bad dosages that were known to be bad

• What you want:

– give good dosages– try informative dosages


Information

• An informative sample improves the model by reducing its uncertainty

−1.0 −0.5 0.0 0.5 1.0

X

−1

0

1

Y

−1.0 −0.5 0.0 0.5 1.0

X

−1

0

1

YHow do we quantity the reduction of uncertaintyafter observing y1, . . . , yt at locations x1, . . . ,xt?


A little bit of information theory

• Mutual information between underlying function f and observationsy1, . . . , yt at locations x1, . . . ,xt:

I(y1, . . . , yt; f) = H(y1, . . . , yt)︸︷︷︸Marginal entropy

−H(y1, . . . , yt|f)︸︷︷︸Conditional entropy

• It quantifies the “amount of information” obtained about one randomvariable, through the other random variable

• Entropy is the “amount of information” held by a random variable

• H(Y |X) = 0 if and only if the value of Y is completely determined bythe value of X

• H(Y |X) = H(Y ) if and only if Y and X are independent randomvariables


Amount of information

H(X) = E[− lnPX] =

n∑i=1

p(xi) logb (p(xi))

Example: Tossing a coin

• A fair coin has maximal entropy: it is the less predictable!

• Every new sample that you get changes your model the most


Mutual information

• Entropy of X ∼ ND(µ,Σ):

H(X) = E

[− ln

exp(−1

2(x− µ)>Σ−1(x− µ))√

(2π)D|Σ|

]=D

2+D

2ln 2π+

1

2ln |Σ|

→ Using E[a>M−1a

]= E

[Tr(a>M−1a

)]= E

[Tr(aa>M−1)] = D

• Recall the joint distribution between m observations y and m∗ querypoints X∗: [

yf∗

]∼ Nm+m∗

(0,

[KX,X + σ2Im KX,X∗

KX∗,X KX∗,X∗

])

• Then I(y1, . . . , yt; f) = 12 ln |Kt + σ2It|


Another decomposition of mutual information

• Recall that y1, . . . , yt|f(x1), . . . , f(xt) ∼ Nt((f(x1), . . . , f(xt)) , σ

2It)

→ Using that yi = f(xi) + ε with ε ∼ N (0, σ2)

• Pluging-in the conditional entropy of a multivariate normal distribution:

I(y1, . . . , yt; f) = H(y1, . . . , yt)−1

2ln |2πeσ2It|

= H(y1, . . . , yt)−t

2ln 2πe− t

2σ2

• What is the marginal entropy H(y1, . . . , yt)?


Entropy of observations

• Recursively decompose

H(y1, . . . , yt) = H(y1, . . . , yt−1) +H(yt|y1, . . . , yt−1)

= H(y1, . . . , yt−1) +1

2ln

[2πe

(σ2 +

σ2

λkλ,t−1(xt, xt)

)]...

= H(y1) +H(y2|y1) + · · ·+ 1

2ln

[2πe

(σ2 +

σ2

λkλ,t−1(xs, xs)

)]

=

t∑s=1

1

2ln

[2πeσ2

(1 +

1

λkλ,s−1(xs, xs)

)]

→ Still using that yi = f(xi) + ε with ε ∼ N (0, σ2)→ Also using the uncertainty about the location of the true f


Information gain

I(y1, . . . , yt; f) =

t∑s=1

1

2ln

[2πeσ2

(1 +

1

λkλ,s−1(xs, xs)

)]− t

2ln(2πe)− t

2σ2

=

t∑s=1

1

2ln

[1 +

1

λkλ,s−1(xs, xs)

]

• Information gain γt(λ) = I(y1, . . . , yt; f): reduction of uncertainty on fafter observing y1, . . . , yt

• Information gain is inversely proportionnal to λ

→ Limiting changes in function limits the contribution of samples


Information gain vs regularization

−1.0 −0.5 0.0 0.5 1.0

X

-1

0

1Y

λ = 0.1

λ = 1

λ = 10

−1.0 −0.5 0.0 0.5 1.0

X

−1

0

1

Y

λ = 0.1

λ = 1

λ = 10

0 25 50 75 100

t

0

1

2

3

γt(λ

)

λ = 0.1

λ = 1

λ = 10

0 25 50 75 100

t

0

10

20

30

40

γt(λ

)

λ = 0.1

λ = 1

λ = 10


Summary

• Streaming regression: sequentially gather (potentially non-i.i.d) samples

• Information gain measures the total information that could result fromadding a new sample (observation) to the model

• Information gain is controlled by the information sharing capability ofthe kernel

• Information gain is controlled by the changes in model admitted byregularization

• Information gain will play a part in confidence intervals


Confidence intervals

• Given that we have gathered t samples under the streaming setting, whatkind of guarantees can we have on the resulting model?

• More specifically, could we guarantee that

|f(x)− ft(x)| ≤ something

simultaneously for all x ∈ X and for all t ≥ 0?

• Motivations:

– Control bad behaviours in critical applications– Help selecting the next sample location∗ Maximize/minimize function?∗ Maximize model improvement?

– Derive sound algorithms


Result (Maillard, 2016)

• Under the assumption of subgaussian noise...

|f(x)− ft(x)| ≤√kλ,t(x,x)

λ

[√λ‖f‖K + σ

√2 ln(1/δ) + 2γt(λ)

]• With probability higher than 1− δ• Simultaneously for all t ≥ 0, for all x ∈ X⇒ Observe that the error bound scales with the information gain!


Subgaussian noise

• A real-valued random variable X is σ2-subgaussian if E[eγX

]≤ eγ2σ2/2

→ The Laplace transform of X is dominated by the Laplace transform of arandom variable sampled from N (0, σ2)

• Require that the tails of the noise distribution are dominated by the tailsof a Gaussian distribution

• For example, true for

– Gaussian noise– Bounded noise


Confidence envelope

0.00 0.25 0.50 0.75 1.00

X

−1

0

1

Y5 observations

0.00 0.25 0.50 0.75 1.00

X

−1

0

1

Y

10 observations

0.00 0.25 0.50 0.75 1.00

X

−1

0

1

Y

50 observations

0.00 0.25 0.50 0.75 1.00

X

−1

0

1

Y

100 observations

f

λ = σ2

λ = σ2/‖f‖2K


Summary

• It is possible to have guarantees even with non-i.i.d. data

• The predition error depends on

– How well your model shares information across observations– How well your model is adapted to the function– How noisy the observations are– Regularization → prior on Σw

• Confidence intervals/envelopes will be really useful for deriving algorithms(we will see that later)