38
Reading Pattern Recognition and Machine Learning §3.3 (Bayesian Linear Regression) Christopher M. Bishop Introduced by: Yusuke Oda (NAIST) @odashi_t 2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 1

Pattern Recognition and Machine Learning: Section 3.3

Embed Size (px)

DESCRIPTION

『パターン認識と機械学習』の輪講で用いた資料。

Citation preview

Page 1: Pattern Recognition and Machine Learning: Section 3.3

Reading

Pattern Recognition and Machine Learning §3.3 (Bayesian Linear Regression)

Christopher M. Bishop

Introduced by: Yusuke Oda (NAIST) @odashi_t

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 1

Page 2: Pattern Recognition and Machine Learning: Section 3.3

Agenda

3.3 Bayesian Linear Regression ベイズ線形回帰

– 3.3.1 Parameter distribution パラメータの分布

– 3.3.2 Predictive distribution 予測分布

– 3.3.3 Equivalent kernel 等価カーネル

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 2

Page 3: Pattern Recognition and Machine Learning: Section 3.3

Agenda

3.3 Bayesian Linear Regression ベイズ線形回帰

– 3.3.1 Parameter distribution パラメータの分布

– 3.3.2 Predictive distribution 予測分布

– 3.3.3 Equivalent kernel 等価カーネル

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 3

Page 4: Pattern Recognition and Machine Learning: Section 3.3

Bayesian Linear Regression

Maximum Likelihood (ML) – The number of basis functions (≃ model complexity)

depends on the size of the data set.

– Adds the regularization term to control model complexity.

– How should we determine the coefficient of regularization term?

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 4

Page 5: Pattern Recognition and Machine Learning: Section 3.3

Bayesian Linear Regression

Maximum Likelihood (ML) – Using ML to determine the coefficient of regularization term

... Bad selection

• This always leads to excessively complex models (= over-fitting)

– Using independent hold-out data to determine model complexity (See §1.3) ... Computationally expensive ... Wasteful of valuable data

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 5

In the case of previous slide,

λ always becomes 0

when using ML to determine λ.

Page 6: Pattern Recognition and Machine Learning: Section 3.3

Bayesian Linear Regression

Bayesian treatment of linear regression – Avoids the over-fitting problem of ML.

– Leads to automatic methods of determining model complexity using the training data alone.

What we do? – Introduces the prior distribution and likelihood .

• Assumes the model parameter as proberbility function.

– Calculates the posterior distribution using the Bayes' theorem:

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 6

Page 7: Pattern Recognition and Machine Learning: Section 3.3

Agenda

3.3 Bayesian Linear Regression ベイズ線形回帰

– 3.3.1 Parameter distribution パラメータの分布

– 3.3.2 Predictive distribution 予測分布

– 3.3.3 Equivalent kernel 等価カーネル

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 7

Page 8: Pattern Recognition and Machine Learning: Section 3.3

Note: Marginal / Conditional Gaussians

Marginal Gaussian distribution for

Conditional Gaussian distribution for given

Marginal distribution of

Conditional distribution of given

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 8

Given:

Then:

where

Page 9: Pattern Recognition and Machine Learning: Section 3.3

Parameter Distribution

Remember the likelihood function given by §3.1.1:

– This is the exponential of quadratic function of

The corresponding conjugate prior is given by a Gaussian distribution:

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 9

known parameter

Page 10: Pattern Recognition and Machine Learning: Section 3.3

Parameter Distribution

Now given:

Then the posterior distribution is shown by using (2.116):

where

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 10

Page 11: Pattern Recognition and Machine Learning: Section 3.3

Online Learning - Parameter Distribution

If data points arrive sequentially, the design matrix has only 1 row:

Assuming that are the n-th input data then we can obtain the formula for online learning:

where

In addition,

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 11

Page 12: Pattern Recognition and Machine Learning: Section 3.3

Easy Gaussian Prior - Parameter Distribution

If the prior distribution is a zero-mean isotropic Gaussian governed by a single precision parameter :

The corresponding posterior distribution is also given:

where

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 12

Page 13: Pattern Recognition and Machine Learning: Section 3.3

Relationship with MSSE - Parameter Distribution

The log of the posterior distribution is given:

If prior distribution is given by (3.52), this result is shown:

– Maximization of (3.55) with respect to

– Minimization of the sum-of-squares error (MSSE) function with the addition of a quadratic regularization term

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 13

Equivalent

Page 14: Pattern Recognition and Machine Learning: Section 3.3

Example - Parameter Distribution

Straight-line fitting – Model function:

– True function:

– Error:

– Goal: To recover the values of

from such data

– Prior distribution:

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 14

Page 15: Pattern Recognition and Machine Learning: Section 3.3

Generalized Gaussian Prior - Parameter Distribution

We can generalize the Gaussian prior about exponent.

In which corresponds to the Gaussian and only in the case is the prior conjugate to the (3.10).

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 15

Page 16: Pattern Recognition and Machine Learning: Section 3.3

Agenda

3.3 Bayesian Linear Regression ベイズ線形回帰

– 3.3.1 Parameter distribution パラメータの分布

– 3.3.2 Predictive distribution 予測分布

– 3.3.3 Equivalent kernel 等価カーネル

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 16

Page 17: Pattern Recognition and Machine Learning: Section 3.3

Predictive Distribution

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 17

Let's consider that making predictions of directly for new values of .

In order to obtain it, we need to evaluate the predictive distribution:

This formula is tipically written:

Marginalization arround

(summing out )

Page 18: Pattern Recognition and Machine Learning: Section 3.3

Predictive Distribution

The conditional distribution of the target variable is given:

And the posterior weight distribution is given:

Accordingly, the result of (3.57) is shown by using (2.115):

where

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 18

Page 19: Pattern Recognition and Machine Learning: Section 3.3

Predictive Distribution

Now we discuss the variance of predictive distribution:

– As additional data points are observed, the posterior distribution becomes narrower:

– 2nd term of the(3.59) goes zero in the limit :

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 19

Addictive noise

goverened by the parameter . This term depends on the mapping vector

. of each data point .

Page 20: Pattern Recognition and Machine Learning: Section 3.3

Predictive Distribution

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 20

Page 21: Pattern Recognition and Machine Learning: Section 3.3

Example - Predictive Distribution

Gaussian regression with sine curve – Basis functions: 9 Gaussian curves

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 21

Mean of predictive distribution

Standard deviation of

predictive distribution

Page 22: Pattern Recognition and Machine Learning: Section 3.3

Example - Predictive Distribution

Gaussian regression with sine curve

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 22

Page 23: Pattern Recognition and Machine Learning: Section 3.3

Example - Predictive Distribution

Gaussian regression with sine curve

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 23

Page 24: Pattern Recognition and Machine Learning: Section 3.3

Problem of Localized Basis - Predictive Distribution

Polynominal regression

Gaussian regression

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 24

Which is better?

Page 25: Pattern Recognition and Machine Learning: Section 3.3

Problem of Localized Basis - Predictive Distribution

If we used localized basis function such as Gaussians, then in regions away from the basis function centers the contribution from the 2nd term in the (3.59) will goes zero.

Accordingly, the predictive variance becomes only the noise contribution . But it is not good result.

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 25

Large contribution

Small contribution

Page 26: Pattern Recognition and Machine Learning: Section 3.3

Problem of Localized Basis - Predictive Distribution

This problem (arising from choosing localized basis function) can be avoided by adopting an alternative Bayesian approach to regression known as a Gaussian process.

– See §6.4.

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 26

Page 27: Pattern Recognition and Machine Learning: Section 3.3

Case of Unknown Precision - Predictive Distribution

If both and are treated as unknown then we can introduce a conjugate prior distribution and corresponding posterior distribution as Gaussian-gamma distribution:

And then the predictive distribution is given:

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 27

Page 28: Pattern Recognition and Machine Learning: Section 3.3

Agenda

3.3 Bayesian Linear Regression ベイズ線形回帰

– 3.3.1 Parameter distribution パラメータの分布

– 3.3.2 Predictive distribution 予測分布

– 3.3.3 Equivalent kernel 等価カーネル

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 28

Page 29: Pattern Recognition and Machine Learning: Section 3.3

Equivalent Kernel

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 29

If we substitute the posterior mean solution (3.53) into the expression (3.3), the predictive mean can be written:

This formula can assume the linear combination of :

Page 30: Pattern Recognition and Machine Learning: Section 3.3

Equivalent Kernel

Where the coefficients of each are given:

This function is calld smoother matrix or equivalent kernel.

Regression functions which make predictions by taking linear combinations of the training set target values are known as linear smoothers.

We also predict for new input vector using equivalent kernel, instead of calculating parameters of basis functions.

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 30

Page 31: Pattern Recognition and Machine Learning: Section 3.3

Example 1 - Equivalent Kernel

Equivalent kernel with Gaussian regression

Equivalen kernel depends on the set of basis function and the data set.

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 31

Page 32: Pattern Recognition and Machine Learning: Section 3.3

Equivalent Kernel

Equivalent kernel means the contribution of each data point for predictive mean.

The covariance between and can be shown by equivalent kernel:

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 32

Large contribution

Small contribution

Page 33: Pattern Recognition and Machine Learning: Section 3.3

Properties of Equivalent Kernel - Equivalent Kernel

Equivalent kernel have localization property even if any basis functions are not localized.

Sum of equivalent kernel equals 1 for all :

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 33

Polynominal Sigmoid

Page 34: Pattern Recognition and Machine Learning: Section 3.3

Example 2 - Equivalent Kernel

Equivalent kernel with polynominal regression

– Moving parameter:

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 34

Page 35: Pattern Recognition and Machine Learning: Section 3.3

Example 2 - Equivalent Kernel

Equivalent kernel with polynominal regression

– Moving parameter:

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 35

Page 36: Pattern Recognition and Machine Learning: Section 3.3

Example 2 - Equivalent Kernel

Equivalent kernel with polynominal regression

– Moving parameter:

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 36

Page 37: Pattern Recognition and Machine Learning: Section 3.3

Properties of Equivalent Kernel - Equivalent Kernel

Equivalent kernel satisfies an important property shared by kernel functions in general: – Kernel function can be expressed in the form of an inner product with

respect to a vector of nonlinear functions:

– In the case of equivalent kernel, is given below:

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 37

Page 38: Pattern Recognition and Machine Learning: Section 3.3

Thank you!

2013/6/5 2013 © Yusuke Oda AHC-Lab, IS, NAIST 38

zzz...