머피의 머신러닝 : Gaussian Processes

1

Gaussian ProcessJungkyu Lee

Daum Search Quality Team

Intuition

• http://explainaway.wordpress.com/2008/12/01/ridiculous-stats/

• x 축 년도 , y 축 : 100 미터 기록 , 음영 = 예측의 uncertainty

• 우리의 직관과 잘 맞는다 = 2030 년 이후 값은 이전 데이터로 예측할 수 없다• 데이터 x( 연도 ) 가 가까우면 , 대응하는 함수값 (100 기록 ) 도 비슷하다고 기대되는 데이터가 주어졌을 때 쓰는

regression

Prerequisite: Conditioning of Multivariate Gaussians• a random vector x R∈ n with x N (µ, ∼ Σ)

• Then the marginals are given by’

• and the posterior conditional is given by

Prerequisite: Bayesian Linear Regression

• 위키 링크• http://play.daumcorp.com/display/~sweaterr

/7.+Linear+Regression#7.LinearRegression-7.6Bayesianlinearregression

•

w 도 Gaussian, likelihood 도 Gaussian 이면 , w 의 posterior 도 Gaussian 이고 , Linear Gaussian Sys-tem 공식에 의해서 구할 수 있다

posterior predictive 도 적분이 tractable 하기 때문에 , 구할 수 있고 , 또한 posterior 의 평균을 weight 로 한 예측치를 평균으로 하는 정규 분포가 된다

http://play.daumcorp.com/display/~sweaterr/7.+Linear+Regression#7.LinearRegression-7.6Bayesianlinearregression






Parametric Models vs. Non-parametric Models

Non-parametric model 은 parametric model 과 다르게 model 의 구조를 가정하지 않고 ,데이터로부터 모든 것을 알아낸다 .

Parametric models:• Linear Regression• GMM

Non-parametric models:• KNN• Kernel Regression• Gaussian Process

Non-parametric Models

Kernel Regression (Non-Parametric, Non-Bayes) GP Regression (Non-Parametric, Bayes)

함수에 대한 분포를 알아내기 때문에 예측의confidence 가 나옴

15.1 Introduction

• In supervised learning, we observe some inputs xi and some outputs yi.

• xi 를 yi 에 mapping 하는 어떤 함수 f 가 있다고 가정했고 , 그 함수를 알아내려 노력했다 .

• 가장 이상적인 방법은 그 함수의 분포까지 알아내서 이것을 예측에 쓰는 것이다

• 여태까지 함수의 parametric 한 표현에 대해서 초점을 맞추어 p(f|D) 대신 p(θ|D) 를 추정했었다 .

• 이제는 함수 자체에 대한 베이지안 추론을 수행한다 .

Probability distributions over functions with finite domains

• 다음과 같은 training example 이 있다고 하자 X = {x1, . . . , xm}

• 여러 함수가 있는 함수 공간에서 하나의 함수를 다음과 같이 표현할 수 있다

• 함수의 domain 의 크기는 m 개의 유한 집합이기 때문에 함수를 다음과 같이 벡터로 표현할 수 있다

• 함수를 분포로 어떻게 표현할 수 있을까 ? 한가지 자연스러운 방법은

• 즉 , 유한 domain 에 대한 함수의 값이 정규분포를 따른다 .

• 공분산이 대각 행렬이므로 , h(x1) 과 h(x2) 는 독립이다

• 그렇다면 , 함수의 domain 이 무한한 크기를 가질 때 , 함수에 대한 분포를 어떻게 표현할까 ?

Probability distributions over functions with infinite domains

• Stochastic process 는 랜덤 변수 의 모음이다• Gaussian process 는 이 랜덤 변수의 sub collection 의 joint probability 가 multivariate Gaussian distribution

(MVN) 인 stochastic process 이다 . • Gaussian Process 는 무한 차원에 대한 분포인데 , 그 중에 일부분만 골라서 , 분포를 보면 MVN 이다

• 유한 집합 x1, ... , xm 에 대해서 , 랜덤 변수 h(x1),..., h(xm) 는 다음과 같은 분포를 가진다

• 무한일 때는 다음과 같이 표현한다 .

• 유한 domain 과 달리 , 공분산이 kernel 로 채워졌다 .

• 함수의 분포는 m 을 평균으로 하는 정규분포를 따르고 , 함수 값 사이의 correlation 은 함수의 input 이 가까우면 강하게 되는 kernel 을 정의한다 .

• 핵심 아이디어는 데이터가 비슷하면 , 이 데이터 포인트에 대한 함수의 값도 비슷할 것이라고 기대하는 것이다• 즉 Gaussian Process 는 무한 차원 ( 무한 domain) 에 대한 것이지만 , 유한 부분 집합에 대해서만 다룰 수 있다• 다행히도 , 유한 부분 집합으로도 , 무한을 고려하는 것과 같은 추론을 할 수 있다는 것이다

Graphical Model for GP

• 2 training points 와 1 test point 를 가진 Gaussian Process

• mixed directed and undirected graphical model representing

• fi = f(xi) 는 각 data points 에서 함수의 값이고 , hidden node 다• Test data point x* 가 training 과 비슷할 수록 , y* 도 training 과 비슷해 진다 .

• 그러므로 , 이전 장에서 했던 Kernel Regression 과 비슷하게 , 가까이 test point 와 가까운 training y 의 f 와 비슷하게 된다

different kernel

• 함수에 대한 분포이므로 , 함수를 sampling 할 수 있다• kernel 에 따라 생성된 함수의 모양이 다르다 .

• kernel 로 인한 GP design 이 어떤 문제에 대한 예측력을 결정하는 가장 큰 요인이다

15.2 GPs for regression

• 함수가 다음과 같은 GP 분포를 따른다고 하자

• where m(x) is the mean function and κ(x, x) is the kernel or covariance function,

• 유한한 데이터 점 집합에 대해서 , 이 프로세스는 joint Gaussian 을 정의한다 (GP 의 성질 )

•

15.2.1 Predictions using noise-free observations

• fi = f(xi) 는 xi 에서 계산된 노이즈 없는 관찰이다 • X ∗ of size N × D∗ 의 test set 이 주어졌을 때 , test set 에 대한 function outputs f∗ 을 예측하고 싶다• 관찰에 노이즈가 없다고 가정하므로 , GP 는 관찰 x 에 대한 f(x) 를 불확실성 없이 예측하고 , test point x* 에

대한 예측 f* 를 다음과 같이 f 의 벡터에 이어붙이는 식으로 표현한다

• By the standard rules for conditioning Gaussians (Section 4.3), the posterior has the following form

• 평균을 0, K 는 identity matrix 라고 하면 , test point 의 f* 평균은 K*Tf 즉 test point 에 비슷한 training point 의

f 에 가중치를 더 준다

training points 간의 Gram matrix

training points 과 test points 간의 Gram matrix

test points 간의 Gram matrix

15.2.1 Predictions using noise-free observations

• 왼쪽 : GP 의 prior

• 오른쪽 : 5 개의 noise free observations 을 봤을 때 GP posterior

• Predictive uncertainty 는 관찰에서 멀어질수록 커진다

15.2.2 Predictions using noisy observations

• 관찰이 noisy 하다고 하자• 노이즈가 없을 때처럼 response 가 바로 f 가 아니다• 관찰치의 response ( 답 ) 에 대한 공분산은 다음과 같이 정의한다

• where δpq = I(p = q). In other words,

• 두 번째 행렬은 각 관찰에 더해지는 노이즈 term 이 독립이기 때문에 diagonal 이다• 노이즈가 있는 관찰치와 , test point 에 대한 noise-free 함수에 대한 joint probability 는

• 여기서 , notation 을 간단히 하기 위해 , 평균을 0 이라고 가정했다• 그러므로 , posterior predictive density 는

• 관찰치에 노이즈가 없다고 가정한 모델과 다른 점은 K 의 diagonal 에 σ2y 가 추가 된 것이다

15.2.2 Predictions using noisy observations

• 한 개의 test point 에 대한 분포는

• where k ∗ = [κ(x , ∗ x1), . . . , κ(x , ∗ xN)] and k ∗∗ = κ(x , ∗ x∗).

• Another way to write the posterior mean is as follows:

• where α = K−1 y y. We will revisit this expression later

15.2.3 Effect of the kernel parameters

• GP 의 예측 성능은 고른 kernel 에 전적으로 달려 있다 .

• Suppose we choose the following squared-exponential (SE) kernel for the noisy observations

• (a) l = 1 good fit

• (b) l = 0.3 늘림 --> 구불해지고 , 불확실성이 커짐 • (c ) l = 3

controls the vertical scale of the function

noise variance

15.2.4 Estimating the kernel parameters

• kernel parameters 를 추정하기 위해서 하는 CV 는 너무 느리다• 다음과 같은 likelihood 를 최대화하는 kernel parameters 를 최적화 한다

• The first term is a data fit term,

• kernel 이 RBF 인 경우 , bandwidth 가 작을수록 , 멀리 있는 점들은 거리가 0 이 되고 , 가까이 있는 점들만 예측에 쓰기 때문에 , GPR 은 다음 그림과 같이 data 에 많이 fit 된다 . bandwidth 가 극도로 작아져서 , Ky 가 di-agonal matrix 가 된 경우 ( 어떤 near point 와도 similarity 가 0 이다 ), log|Ky| 는 매우 커지겠지만 ( 대각 행렬의 determinant 는 대각 원소의 합 ), 데이터 Fit 에러는 작어진다

• 반대로 kernel parameter(bandwidth) 가 커지면 , data fit error 는 커지고 , log|Ky| 는 작아진다• 그러므로 , 첫번째 term = likelihood, 두 번째 term = model complexity 라고 볼 수 있고 , trade off 가 있다 .

• 다음과 목적함수를 최소화 하는 kernel parameter l 를 찾는다 .

• 아래와 같이 gradient 를 구하고

• gradient descent 같은 standard gradient-based optimizer 로 최적화

15.3 GPs meet GLMs

• GP 를 다른 GLMs 모델에 적용할 수 있다• 이전에는 f(x) = wTx 처럼 함수의 구조를 x 에 선형으로 맞추었지만 , 이제는 f ~ GP() 이다• 예를 들어서 15.3.1 binary classification 모델에 대해서 적용해보면 • define the model

• yi {−∈ 1, +1}, and we let σ(z) = sigm(z) (logistic regression)

• yi = 1 이면 , fi 는 0 보다 커야 확률이 0.5 보다 커짐• yi = -1 이면 fi 는 0 보다 작어야 확률이 0.5 보다 커짐• As for GP regression, we assume f ∼ GP(0, κ).

Prerequisite: Gaussian approximation(aka Laplace ap-proximation)• http://play.daumcorp.com/pages/viewpage.action?pageId=157627536#8Logisticregression-8.4.1Laplaceapproximat

ion

• posterior 를 정규 분포로 근사하는 방법

posterior 를 평균은 posterior 의 MAP 추정이고 , 분산은 posterior 를 두번 미분한 것의 역행렬인 정규분포로 근사하는 것

http://play.daumcorp.com/pages/viewpage.action?pageId=157627536#8Logisticregression-8.4.1Laplaceapproximation

http://play.daumcorp.com/pages/viewpage.action?pageId=157627536#8Logisticregression-8.4.1Laplaceapproximation

Prerequisite: Gaussian approximation for logistic re-gression

• 결과적으로 hat{w} 는 앞에서 했던 logistic regression 의 MAP 추정이 되고 H 는 posterIor 를 두 번 미분한것

15.3 GPs meet GLMs15.3.1.1 Computing the posterior• Define the log of the unnormalized posterior as follows:

• Let J(f) −(f) be the function we want to minimize

• The gradient and Hessian of this are given by

• Newton’s Method 를 사용해서 최적화를 하면

• 이렇게 f 의 posterior 를 최적화 한 f 를 hat{f} 이라 하자• 수렴시 , f 의 posterior 의 Gaussian Approximation 은

σ 함수가 무엇이냐에 따라 다른 것

15.3 GPs meet GLMs15.3.1.2 Computing the posterior predictive• Test point 에 대한 posterior predictive 는

f 의 기대값을 f 의 MAP 추정값으로 근사

15.3 GPs meet GLMs15.3.1.3 Computing the marginal likelihood• Kernel parameter 를 최적화하기 위해 marginal likelihood 가 필요하다

• LogP(D) 를 kernel parameter 에 대한 미분으로 최적화하는데 , kernel parameter 가 W,K,f 에 모두 depend 하기 때문에 어렵다 ((Rasmussen and Williams 2006,p125))

15.3.1.1 Computing the posterior 의 결과에 최적값 넣음

5 Summary

• 왜 Gaussian Process 가 좋은가 ?

• Bayesian method 이다• 예측의 uncertainty 를 수치화 할 수 있다• 여러 model selection 과 hyperparameter selection 과 같은 Bayesian method 를 그대로 사용할 수 있다

• 앞에서 RBF 의 bandwidth 에 대한 likelihood 를 최적화하는 방법 같은

• Non-parametric 이다• input point 에 대한 임의의 함수를 모델링한다 (No model assumption)

Technology

머피의 머신러닝 : Gaussian Processes