Lecture 5: GPs and Streaming regression
• Gaussian Processes
• Information gain
• Confidence intervals
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 1
Recall: Non-parametric regression
• Input space X ⊂ Rn, target space Y = R• Input matrix X of size m× n, target vector y of size m× 1
• Feature mapping φ : X 7→ Rd
• Assumption: y = Φw
• Minimize
Jλ(w) =1
2(Φw − y)>(Φw − y) +
λ
2w>w λ ≥ 0
• The solution is
w = (Φ>Φ + λIn)−1Φ>y = Φ>(ΦΦ> + λIm)−1y
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 2
Recall: Kernel regression
• Let K = ΦΦ> and k(x) = φ(x)>Φ> be the kernel matrix/vector:
K =
k(x1,x1) k(x1,x2) . . . k(x1,xm)k(x2,x1) k(x2,x2) . . . k(x2,xm)
... ... ...k(xm,x1) k(xm,x2) . . . k(xm,xm)
k(x) =
k(x,x1)k(x,x2)
...k(x,xm)
• The predictions for the input data are given by
y = ΦΦ>(ΦΦ> + λIm)−1y = K(K + λIm)−1y
• The prediction for a new input point x is given by
f(x) = φ(x)>Φ>(K + λIm)−1y = k(x)(K + λIm)−1y
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 3
Recall: Bayesian view of regression
• Consider noisy observations y = f(x) + ε = φ(x)>w + ε
• Recall Bayes’ rule: posterior = likelihood×priormarginal likelihood
Pφ(w|y,X) =Pφ(y|X,w)P (w)
Pφ(y|X)
⇒ Marginal likelihood is independent of weights w
• With Gaussian noise ε ∼ N (0, σ2)
Pφ(y|X,w) =
m∏i=1
Pφ(yi|xi,w) =
m∏i=1
1√2πσ
exp
(−(yi − φ(xi)
>w)2
2σ2
)
=1
(√
2πσ)mexp
(−‖y −Φw‖2
2σ2
)= Nm
(Φw, σ2Im
)COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 4
Posterior distribution on parameters
• With Gaussian prior on parameters w ∼ Nd(0,Σw)
Pφ(w|y,X) ∝ exp
(−‖y −Φw‖2
2σ2
)exp
(−w>Σ−1w w
2
)= exp
(−y>y − y>Φw −w>Φy + w>Φ>Φw + σ2w>Σ−1w w
2σ2
)= exp
(−y>y − y>Φw −w>Φy + w>(Φ>Φ + σ2Σ−1w )w
2σ2
)∝ exp
((w − b)>A−1(w − b)
)where A−1 = σ−2(Φ>Φ + σ2Σ−1w ) and b = (Φ>Φ + σ2Σ−1w )−1Φ>y
⇒ The posterior distribution is Gaussian!
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 5
Predictive distribution
• The pointwise posterior predictive distribution is a normal distribution
f(x)|x1, . . . ,xm, y1, . . . , ym ∼ N(f(x), s2(x)
)of expectation
f(x) = φ(x)>(Φ>Φ + σ2Σ−1w )−1Φ>y
= φ(x)>ΣwΦ>(ΦΣwΦ> + σ2Im)−1y
and variance
s2(x) = σ2φ(x)>(Φ>Φ + σ2Σ−1w )−1φ(x)
= φ(x)>Σwφ(x)− φ(x)>ΣwΦ>(Φ>ΣwΦ + σ2Im)−1ΦΣwφ(x)
→ using Sherman-Morrison
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 6
Reinterpreting regularization
• Recall kernel regression predictions:
f(x) = k(x)>(K + λIm)−1y
• Using prior Σw = σ2
λ Id, the predictive mean rewrites as:
f(x) = φ(x)>ΣwΦ>(ΦΣwΦ> + σ2Im)−1y
= φ(x)>σ2
λΦ>
(Φσ2
λΦ> + σ2Im
)−1y
= k(x)>(K + λIm)−1y
⇒ λ encodes some prior on weights w
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 7
Reinterpreting regularization (cont’d)
• Still using Σw = σ2
λ Id, the predictive variance rewrites as:
s2(x) = φ(x)>Σwφ(x)− φ(x)>ΣwΦ>(Φ>ΣwΦ + σ2Im)−1ΦΣwφ(x)
= φ(x)>σ2
λφ(x)− φ(x)>
σ2
λΦ>
(Φ>
σ2
λΦ + σ2Im
)−1Φσ2
λφ(x)
=σ2
λkλ(x,x) with
kλ(x,x′) = k(x,x′)− k(x)> (K + λIm)−1
k(x′)
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 8
Summary
• Using prior Σw = σ2
λ Id, we have
f(x) = k(x)>(K + λIm)−1y
s2(x) =σ2
λkλ(x,x) with
kλ(x,x′) = k(x,x′)− k(x)> (K + λIm)−1
k(x′)
• Providing pointwise posterior prediction
f(x)|x1, . . . ,xm, y1, . . . , ym ∼ N(f(x), s2(x)
)⇒ What does it mean to use λ ∈ R≥0?
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 9
Pointwise posterior distribution
• At each point x ∈ X , we have a distribution N(f(x), s2(x)
)• We can sample from these f(x) ∼ N
(f(x), s2(x)
)
−1.0 −0.5 0.0 0.5 1.0
X
−1.0
−0.5
0.0
0.5
1.0
Y
f s
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 10
Joint distribution
• Suppose you query your model at locations X∗• Extend the prior to include query points:[
ff∗
]|X,X∗ ∼ Nm+m∗
(0,
[KX,X KX,X∗KX∗,X KX∗,X∗
])y|f ∼ Nm(f , σ2Im)
• Using joint normality of f∗ and y:[yf∗
]∼ Nm+m∗
(0,
[KX,X + σ2Im KX,X∗
KX∗,X KX∗,X∗
])
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 11
Gaussian Process (GP)
• By considering the covariance between every points in the space, we geta distribution over functions!
• Posterior distribution on f :
P [f |X,y] ∼ N|X |([f(x)
]x∈X
,σ2
λ[kλ(x,x′)]x,x′∈X
)
−1.0 −0.5 0.0 0.5 1.0
X
−1.0
−0.5
0.0
0.5
1.0
Y
f s
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 12
Sampling from a Gaussian Process
• Generalization of normal probability distribution to the function space
– From a normal distribution we sample variables– From a GP we sample functions!
−1.0 −0.5 0.0 0.5 1.0
X
−1
0
1
2Y
Sampled f
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 13
Sampling from a Gaussian Process: How to
• Observe that
N|X |([f(x)
]x∈X
,σ2
λ[kλ(x,x′)]x,x′∈X
)defines a |X |-dimensional multivariate Gaussian distribution
→ What if |X | =∞ (e.g. X = [−1, 1])?
• We can consider a discrete, finite, set X ⊂ X and sample from
N|X|([f(x)
]x∈X
,σ2
λ[kλ(x,x′)]x,x′∈X
)
• This will result in a function f evaluated at every x ∈ X
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 14
Learning the hyperparameters
• If we assume that Σw = Id, then we have λ = σ2
• Let θ denote the kernel hyperparameters (e.g. ρ)
• In practice you may not know the noise and the optimal kernel hypers
• Recall: multivariate normal density
P (y|θ) =exp
(−1
2y>(Kθ + σ2Im)−1y
)√(2π)D|Kθ + σ2Im|
• Maximize the marginal likelihood L = logP (y|θ):
L = −1
2y>(Kθ + σ2Im)−1y − D
2log(2π)− 1
2log |Kθ + σ2Im|
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 15
Anatomy of marginal likelihood
• Marginal likelihood:
L = logP (y|θ) ∝ −1
2y>(Kθ + σ2Im)−1y − 1
2log |Kθ + σ2Im|
• 1st term: quality of predictions; 2nd term: model complexity• Trade-off (from Rasmussen & Williams, 2006):
C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,ISBN 026218253X. c� 2006 Massachusetts Institute of Technology. www.GaussianProcess.org/gpml
5.4 Model Selection for GP Regression 113
100−100
−80
−60
−40
−20
0
20
40
log
prob
abilit
y
characteristic lengthscale
minus complexity penaltydata fitmarginal likelihood
100−100
−80
−60
−40
−20
0
20
Characteristic lengthscale
log
mar
gina
l lik
elih
ood
95% conf int
82155
(a) (b)
Figure 5.3: Panel (a) shows a decomposition of the log marginal likelihood intoits constituents: data-fit and complexity penalty, as a function of the characteristiclength-scale. The training data is drawn from a Gaussian process with SE covariancefunction and parameters (`,�f ,�n) = (1, 1, 0.1), the same as in Figure 2.5, and we arefitting only the length-scale parameter ` (the two other parameters have been set inaccordance with the generating process). Panel (b) shows the log marginal likelihoodas a function of the characteristic length-scale for di↵erent sizes of training sets. Alsoshown, are the 95% confidence intervals for the posterior length-scales.
and we re-state the result here
log p(y|X,✓) = �1
2y>K�1
y y � 1
2log |Ky|� n
2log 2⇡, (5.8)
where Ky = Kf +�2nI is the covariance matrix for the noisy targets y (and Kf
is the covariance matrix for the noise-free latent f), and we now explicitly writethe marginal likelihood conditioned on the hyperparameters (the parameters ofthe covariance function) ✓. From this perspective it becomes clear why we calleq. (5.8) the log marginal likelihood, since it is obtained through marginaliza- marginal likelihood
tion over the latent function. Otherwise, if one thinks entirely in terms of thefunction-space view, the term “marginal” may appear a bit mysterious, andsimilarly the “hyper” from the ✓ parameters of the covariance function.4
The three terms of the marginal likelihood in eq. (5.8) have readily inter- interpretation
pretable roles: the only term involving the observed targets is the data-fit�y>K�1
y y/2; log |Ky|/2 is the complexity penalty depending only on the co-variance function and the inputs and n log(2⇡)/2 is a normalization constant.In Figure 5.3(a) we illustrate this breakdown of the log marginal likelihood.The data-fit decreases monotonically with the length-scale, since the model be-comes less and less flexible. The negative complexity penalty increases with thelength-scale, because the model gets less complex with growing length-scale.The marginal likelihood itself peaks at a value close to 1. For length-scalessomewhat longer than 1, the marginal likelihood decreases rapidly (note the
4Another reason that we like to stick to the term “marginal likelihood” is that it is thelikelihood of a non-parametric model, i.e. a model which requires access to all the trainingdata when making predictions; this contrasts the situation for a parametric model, which“absorbs” the information from the training data into its (posterior) parameter (distribution).This di↵erence makes the two “likelihoods” behave quite di↵erently as a function of ✓.
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 16
Gradient-based optimization
• Compute gradients:
∂L∂θi
=1
2y>(Kθ + σ2Im)−1
∂(Kθ + σ2Im)
∂θi(Kθ + σ2Im)−1y
− 1
2Tr
((Kθ + σ2Im)−1
∂(Kθ + σ2Im)
∂θi
)
• Minimize the negative
• Non-convex optimization task
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 17
Summary
• Normal priors on the weights distribution → Gaussian Process
• Regularization → prior on the weights covariance
• GP provides a posterior distribution on functions
– Expectation: kernel regression model– Covariance → confidence intervals
• Sample discretized functions from a GP
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 18
Typical supervised setting
• Have dataset (X,y) of previously acquired data
• Learn model w on (X,y)
• Apply model w to provide predictions at new data points
• Samples (X,y) are often assumed to be i.i.d.
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 19
Streaming setting
• Start with (possibly empty) dataset (X0, y0)
• For each time step t = 1, 2, . . . :
– Fit model wt on Xt−1 and yt−1– Acquire a new sample (xt, yt) (possibly using wt)– Define Xt = [xi]i=1...t, yt = [yi]i=1...t
• Samples may be dependent of model (not i.i.d.)
→ Samples influence model / Samples depend on model
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 20
Application: Online function optimization
0.0 0.5 1.0
X
0.0
0.5
1.0
Y
?
• Unknown function f : X 7→ R• Sequentially select locations (xt)t≥1 at which to observe the function yt
• Try to maximize/minimize the observations
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 21
Example
What is the optimal treatment dosage for a given disease?
• For each time step t = 1, 2, . . . :
– New patient t comes in– We decide on treatment dosage xt– We observe the patient’s response to treatement, yt
• Our goal is to cure patients as effectively as possible
• What you don’t want: give bad dosages that were known to be bad
• What you want:
– give good dosages– try informative dosages
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 22
Information
• An informative sample improves the model by reducing its uncertainty
−1.0 −0.5 0.0 0.5 1.0
X
−1
0
1
Y
−1.0 −0.5 0.0 0.5 1.0
X
−1
0
1
YHow do we quantity the reduction of uncertaintyafter observing y1, . . . , yt at locations x1, . . . ,xt?
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 23
A little bit of information theory
• Mutual information between underlying function f and observationsy1, . . . , yt at locations x1, . . . ,xt:
I(y1, . . . , yt; f) = H(y1, . . . , yt)︸ ︷︷ ︸Marginal entropy
−H(y1, . . . , yt|f)︸ ︷︷ ︸Conditional entropy
• It quantifies the “amount of information” obtained about one randomvariable, through the other random variable
• Entropy is the “amount of information” held by a random variable
• H(Y |X) = 0 if and only if the value of Y is completely determined bythe value of X
• H(Y |X) = H(Y ) if and only if Y and X are independent randomvariables
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 24
Amount of information
H(X) = E[− lnPX] =
n∑i=1
p(xi) logb (p(xi))
Example: Tossing a coin
• A fair coin has maximal entropy: it is the less predictable!
• Every new sample that you get changes your model the most
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 25
Mutual information
• Entropy of X ∼ ND(µ,Σ):
H(X) = E
[− ln
exp(−1
2(x− µ)>Σ−1(x− µ))√
(2π)D|Σ|
]=D
2+D
2ln 2π+
1
2ln |Σ|
→ Using E[a>M−1a
]= E
[Tr(a>M−1a
)]= E
[Tr(aa>M−1)] = D
• Recall the joint distribution between m observations y and m∗ querypoints X∗: [
yf∗
]∼ Nm+m∗
(0,
[KX,X + σ2Im KX,X∗
KX∗,X KX∗,X∗
])
• Then I(y1, . . . , yt; f) = 12 ln |Kt + σ2It|
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 26
Another decomposition of mutual information
• Recall that y1, . . . , yt|f(x1), . . . , f(xt) ∼ Nt((f(x1), . . . , f(xt)) , σ
2It)
→ Using that yi = f(xi) + ε with ε ∼ N (0, σ2)
• Pluging-in the conditional entropy of a multivariate normal distribution:
I(y1, . . . , yt; f) = H(y1, . . . , yt)−1
2ln |2πeσ2It|
= H(y1, . . . , yt)−t
2ln 2πe− t
2σ2
• What is the marginal entropy H(y1, . . . , yt)?
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 27
Entropy of observations
• Recursively decompose
H(y1, . . . , yt) = H(y1, . . . , yt−1) +H(yt|y1, . . . , yt−1)
= H(y1, . . . , yt−1) +1
2ln
[2πe
(σ2 +
σ2
λkλ,t−1(xt, xt)
)]...
= H(y1) +H(y2|y1) + · · ·+ 1
2ln
[2πe
(σ2 +
σ2
λkλ,t−1(xs, xs)
)]
=
t∑s=1
1
2ln
[2πeσ2
(1 +
1
λkλ,s−1(xs, xs)
)]
→ Still using that yi = f(xi) + ε with ε ∼ N (0, σ2)→ Also using the uncertainty about the location of the true f
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 28
Information gain
I(y1, . . . , yt; f) =
t∑s=1
1
2ln
[2πeσ2
(1 +
1
λkλ,s−1(xs, xs)
)]− t
2ln(2πe)− t
2σ2
=
t∑s=1
1
2ln
[1 +
1
λkλ,s−1(xs, xs)
]
• Information gain γt(λ) = I(y1, . . . , yt; f): reduction of uncertainty on fafter observing y1, . . . , yt
• Information gain is inversely proportionnal to λ
→ Limiting changes in function limits the contribution of samples
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 29
Information gain vs regularization
−1.0 −0.5 0.0 0.5 1.0
X
-1
0
1Y
λ = 0.1
λ = 1
λ = 10
−1.0 −0.5 0.0 0.5 1.0
X
−1
0
1
Y
λ = 0.1
λ = 1
λ = 10
0 25 50 75 100
t
0
1
2
3
γt(λ
)
λ = 0.1
λ = 1
λ = 10
0 25 50 75 100
t
0
10
20
30
40
γt(λ
)
λ = 0.1
λ = 1
λ = 10
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 30
Summary
• Streaming regression: sequentially gather (potentially non-i.i.d) samples
• Information gain measures the total information that could result fromadding a new sample (observation) to the model
• Information gain is controlled by the information sharing capability ofthe kernel
• Information gain is controlled by the changes in model admitted byregularization
• Information gain will play a part in confidence intervals
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 31
Confidence intervals
• Given that we have gathered t samples under the streaming setting, whatkind of guarantees can we have on the resulting model?
• More specifically, could we guarantee that
|f(x)− ft(x)| ≤ something
simultaneously for all x ∈ X and for all t ≥ 0?
• Motivations:
– Control bad behaviours in critical applications– Help selecting the next sample location∗ Maximize/minimize function?∗ Maximize model improvement?
– Derive sound algorithms
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 32
Result (Maillard, 2016)
• Under the assumption of subgaussian noise...
|f(x)− ft(x)| ≤√kλ,t(x,x)
λ
[√λ‖f‖K + σ
√2 ln(1/δ) + 2γt(λ)
]• With probability higher than 1− δ• Simultaneously for all t ≥ 0, for all x ∈ X⇒ Observe that the error bound scales with the information gain!
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 33
Subgaussian noise
• A real-valued random variable X is σ2-subgaussian if E[eγX
]≤ eγ2σ2/2
→ The Laplace transform of X is dominated by the Laplace transform of arandom variable sampled from N (0, σ2)
• Require that the tails of the noise distribution are dominated by the tailsof a Gaussian distribution
• For example, true for
– Gaussian noise– Bounded noise
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 34
Confidence envelope
0.00 0.25 0.50 0.75 1.00
X
−1
0
1
Y5 observations
0.00 0.25 0.50 0.75 1.00
X
−1
0
1
Y
10 observations
0.00 0.25 0.50 0.75 1.00
X
−1
0
1
Y
50 observations
0.00 0.25 0.50 0.75 1.00
X
−1
0
1
Y
100 observations
f
λ = σ2
λ = σ2/‖f‖2K
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 35
Summary
• It is possible to have guarantees even with non-i.i.d. data
• The predition error depends on
– How well your model shares information across observations– How well your model is adapted to the function– How noisy the observations are– Regularization → prior on Σw
• Confidence intervals/envelopes will be really useful for deriving algorithms(we will see that later)
COMP-652 and ECSE-608, Lecture 5 - September 19, 2017 36