머피의 머신러닝 13 Sparse Linear Model

Sparse Linear Model

Jungkyu LeeDaum Search Quality Team

13.1 Introduction

• model-based approach 를 사용해서 feature selection 하는 방법에 대해서 알아본다• Application

• small N, large D proble 의 경우 , featur 가 너무 많기 때문에 , feature selection 을 하고 싶다• 14 장에서 , kernel function 에 대해서 다룬다 . (sparse kernel machine)

• feature selecton 이 N 개의 training example 중 부분 집합만 사용하는 방법이다 .

5.3 Bayesian model selection

• regression 시 너무 높은 degree 의 polynomial 을 쓰면 overfitting 이 일어날 수 있고 반대로 너무 낮은 degee 의 polynomial 을 쓰면 underfiting 이 일어날 수 있다

• 다른 복잡도를 가진 모델을 만날을 때 , 일반적으로 어떤 것이 가장 좋은 모델인가 ?

• 13 장에서 다룰 것 model = feature subset 입니다• Approach

• One approach is to use cross-validation to estimate the generalization error of all the candi-date models, and then to pick the model that seems the best.

• A more efficient approach is to compute the posterior over models (Bayesian model selection.

• If we use a uniform prior over models, p(m) 1, this amounts to picking the model which maxi∝ -mizes

marginal likelihood

cross-validation 은 train 와 test 셋을 나누어야 하고 ( 보통 cs community 에서 많이 함 )posterior 법은 train set 으로만 하는 것 같다 (bic,aic) 이건 솔직히 왜 하는지는 아직 이해는 안가지만

5.3.2.4 BIC approximation to log marginal likelihood

• In general, computing the integral in Equation 5.13 can be quite difficult.

• Bayesian information criterion or BIC

• dof (ˆθ) is the number of degrees of freedom

• penalized log likelihood

likelihood model complexity

Bayesian variable selection

The spike and slab model

Beroulli Gaussian Model

l0 regulization

l1 regulization (lasso)

p(D|γ) 구하는 방법

최적화 어려움

wj 를 계속 살림

13.2 Bayesian variable selection

• 어떤 피쳐가 릴러번트한지를 랜덤변수로 본다 .

• model = m = γ

• Let γj =1 if feature j is “relevant”, and let γj =0 otherwise.

• Our goal is to compute the posterior over models

13.2 Bayesian variable selectionlinregAllsubsetsGraycodeDemo.




l0 regulization



최적화 어려움


13.2.1 The spike and slab model

• 을 구체적으로 구하는 방법에 대해서 논의한다 (linear regression 의 경우 )

• The posterior is given by

the number of non-zero elements of the vector.

13.2 Bayesian variable selection13.2.1 The spike and slab model

• γ 이 0 인 것의 feature 를 X 와 w 에서 없앤다 , Xr, wr

feature selection γ 에 따라 p(D|γ) 의 분산이 바뀐다


• When the marginal likelihood cannot be computed in closed form (e.g., if we are using logistic re-gression or a nonlinear model) . we can approximate it using BIC

model complexity 로 페널티


• 요약하면 , p(γ|D) 을 구하기 위해

• 결과적으로 feature relevance vector γ 의 posterior 는

• 즉 (maginal likelihood) – (model complexity) = (likelihood – model complexity) – (model complexity)

• complexity 에 대한 penalties 가 두 번 일어나는데 , 그냥 λ 하나로 묶는다




l0 regulization



최적화 어려움


13.2 Bayesian variable selection13.2.2 From the Bernoulli-Gaussian model to l0 regularization

• Bernoulli Gaussian model, binary mask model

• spike and slab model 과는 다르게 , irrelevant 한 coefficients 들이 사라지지 않는다• the binary mask model has the form γj →y←wj, whereas the spike and slab model has the form

γj →wj →y.


• the Bernoulli-Gaussian model 은 l0 regularization 을 유도하는데 사용된다 .

• 데이터가 주어졌을 때 , γ 와 w 의 posterior 는

• joint prior p(γ, w) 는 다음과 같이 정의한다

즉 위의 함수를 최소화하는 γ 와 w = posterior 가 가장 큰 γ 와 w


• σ2w→∞, 이면 ,

• likelihood 에 model complexity 를 더한 BIC 근사와 비슷한 모양이 되었다• bit vector γ 을 없애고 0 이 아닌 wj 만 표현하므로써 , 다음과 같이 표현할 수 있다 .

• 이것을 l0 regularization 이라고 부른다 .

• 하지만 lo regularization 은 최적화하기 어렵다 .

• 이 장의 나머지에서 l0 regularization 을 최적화하는 방법에 대해서 알아본다 (lasso)

13.2 Bayesian variable selection13.2.3 Algorithms

• 앞에서는 γ 를 찾을 때 최적화로 찾을 수도 있다 (lasso)

• 하지만 , 이러한 γ 최적화가 불가능한 경우도 있다 .

• Since there are 2D models, we cannot explore the full posterior, or find the globally optimal model.

• Instead we will have to resort to heuristics of one form or another.

• All of the methods we will discuss involve searching through the space of models, and evaluating the cost f(γ) at each point.

13.2 Bayesian variable selection13.2.3 Algorithms13.2.3.1 Greedy search

• Single best replacement:

• 가장 간단한 방법은 greedy hill climbing 을 사용하는 것이다 .

• 각 단계에서 , 변수 하나를 추가하거나 뺌으로써 , 도달할 수 있는 모델의 이웃을 정의한다 .

• 즉 각 변수에 대해서 , 그 것을 추가해서 현재 모델을 능가한다면 추가하고 , 그 변수를 뺌으로써 능가한다면 , 그 변수를 뺀다 .


• Orthogonal least squares

• λ=0 이면 , 식 (13.27) 에서 모델의 complexity penalty 는 없어지고 , deletion step 의 이유가 없어진다 . 왜냐하면 , 변수를 쓰지 않음으로써 얻는 이점이 사라지기 때문이다 (training error 는 계속 준다 )

• 이 경우 , SBR 은 orthogonal least squares = greedy forwards selection 와 같아진다

• 현재 feature 집합에서 , feature 를 하나씩 추가해보고 w 를 최적화하면서 , 에러가 가장 적은 fea-ture 를 고른다 .

• We then update the active set by setting γ (t+1)=γ(t) {j }∪ ∗

• To choose the next feature to add at step t, we need to solve D−Dt least squares problems at step t,where Dt =|γt| is cardinality of the current active set.

(13.27)


• Orthogonal matching pursuits

• so we are just looking for the column that is most correlated with the current residual

• This only requires one least squares calculation per iteration and so is faster than orthogonal least squares, but is not quite as accurate • 다해보지 말고 , 가장 , residual 과 연관 있는 feature 만 테스트한다

• even more aggressive approximation is to just greedily add the feature that is most correlated with the current residual.

• This is called matching pursuits(Mallat and Zhang 1993).

• This is also equivalent to a method known as least squares boosting (Section 16.4.6).


• Backwards selection Backwards selection• starts with all variables in the model (the so called saturated model), and then deletes the worst

one at each step.• This is equivalent to performing a greedy search from the top of the lattice downwards.• This can give better results than a bottom-up search, since the decision about whether to keep a

variable or not is made in the context of all the other variables that might depend on it.( 의존 관계가 있을 feature 들이 있는 상태에서 selection 을 하므로 , 성능은 더 좋음 )

• However, this method is typically infeasible for large problems, since the saturated model will be too expensive to fit.(=fit 할 feature 가 많아서 계산은 많이 한다 )

• Bayesian Matching pursuit• The algorithm of (Schniter et al. 2008) is similiar to OMP except it uses a Bayesian marginal like-

lihood scoring criterion (under a spike and slab model) instead of a least squares objective.

13.2 Bayesian variable selection13.2.3 Algorithms13.2.3.2 Stochastic search

• If we want to approximate the posterior, rather than just computing a mode (e.g. because we want to compute marginal inclusion probabilities), one option is to use MCMC.

• The standard approach is to use Metropolis Hastings, where the proposal distribution just flips single bits

• This enables us to efficiently compute p(γ’|D) given p (γ|D).

• The probability of a state (bit configuration) is estimated by counting how many times the random walk visits this state.

γ 을 이렇게 , 2^D 조합을 다 찾거나 , 휴리스틱하게 찾는 방법 말고 , analytically 최적화 하는 방법은 없는 걸까 ? lasso




l0 regulization



최적화 어려움


13.3 l1 regularization: basics 왜 l0 에서 l1 으로 바꾸는가 ?

• When we have many variables, it is computationally difficult to find the posterior mode of p(γ|D).

• Part of the problem is due to the fact that the γj variables are discrete, γj {0,1}∈ .

• In the optimization community, it is common to relax hard constraints of this form by replacing dis-crete variables with continuous variables.

• We can do this by replacing the spike-and-slab style prior, that assigns finite probability mass to the event that wj =0, to continuous priors that “encourage” wj =0 by putting a lot of probability density near the origin, such as a zero-mean Laplace distribution.

• l1 regularization

• In the case of linear regression, the l1 objective becomes

13.3.1 Why does l1 regularization yield sparse solu-tions?• lasso, which stands for “least absolute shrinkage and selection operator”

• 코너에 거칠 확률이 더 커진다

제약조건

목적함수

모서리에서의 페널티가 더 작다 .모서리에 붙는 w 가 최적화에 선호된다모서리에 붙는 w 라는 건 sparse 한 w 이다

모서리에서의 페널티가 더 작다 .

13.3 l1 regularization: basics13.3.2 Optimality conditions for lasso

• The lasso objective has the form

• Unfortunately, the||w||1 term is not differentiable whenever wj =0.

• This is an example of a non-smooth optimization problem.


• To handle non-smooth functions, we need to extend the notion of a derivative.

• We define a subderivative or subgradient of a (convex) function f: I→R at a point θ0 to be a scalar g such that

• We define the set of subderivatives as the interval[a, b] where a and b are the one-sided limits


• The set [a, b] of all subderivatives is called the subdifferential of the function f at θ0 and is denoted ∂f(θ)|θ0.

• For example, in the case of the absolute value function f(θ)=|θ|, the subderivative is given by

• If the function is everywhere differentiable, then ∂f(θ)={df(θ)/dθ}.

0 에서의 미분값이 무한히 많다


• Let us apply these concepts to the lasso problem.

• Let us initially ignore the non-smooth penalty term.

• where w−j is w without component j, and similarly for xi,−j.

• We see that cj is (proportional to) the correlation between the j’th feature x:,j and the residual due to the other features, r−j =y−X:,−jw−j.

• X 행렬에서 j 번째 feature 만 고른 벡터와 j 번째 feature 만 빼고 예측한 값과 실제 값과의 차이 벡터와의 correlation

• Hence the magnitude of cj is an indication of how relevant feature j is for predicting y(relative to the other features and the current parameters).

• j 가 예측에 포함됨으로써 , y 와의 차이를 메꿔줄 수 있는지의 정도 ?

j feature 를 제외한 나머지 feature로 예측한 residual 과 j feature와의 correlation

• 그러므로 f 를 최적화하는 w 는 cj 의 범위에 따라 다음과 같이 정의할 수 있다

• where

• and x+= max(x,0) is the positive part of x. This is called soft thresholding.

j feature 를 제외한 나머지 feature 로 예측한 residual과 j feature 와의 correlation cj 가 – λ 보다 크게 음의 상관관계가 있지 않거나 , λ 의 이상의 음의 상관관계가 있지 않으면 feature j 는 0 즉 안쓴다

13.4.1 Coordinate descent

• j 번째 featur 를 제외한 나머지 features 는 고정하고 , j 번째 feature 만 최적화 한다

Coordinate descent for lasso (aka shooting algorithm)

• Coordinate descent 는 one-dimensianl optimization problem 이 analytically 풀리면 유용하다 .

• 앞에서 보았듯이 lasso 의 최적해 w 는 나머지 coefficien 가 고정된 상태에서 , 특정 featur 에 대한 wj 를 최적화 할 수 있다 .

• See (Yaun et al. 2010) for some extensions of this method to the logistic regression case.

• resulting algorithm was the fastest method in their experimental comparison, which concerned doc-ument classification with large sparse feature vectors(representing bags of words)

• By contrast, in Figure 13.5(b), we illustrate hard thresholding.

• This sets values of wj to 0 if −λ≤cj ≤λ, but it does not shrink the values of wj outside of this interval.

• The slope of the soft thresholding line does not coincide with the diagonal, which means that even large coefficients are shrunk towards zero;

• consequently lasso is a biased estimator.

• This is undesirable, since if the likelihood indicates (via cj) that the coefficient wj should be large, we do not want to shrink it. We will discuss this issue in more detail in Section 13.6.2.

13.3.3 Comparison of least squares, lasso, ridge and subset selection• For simplicity, assume all the features of X are orthonormal, so XTX=I. In this case, the RSS is given by

13.3.3 Comparison of least squares, lasso, ridge and subset selection• LS = least squares,

• Subset = best subset regression(all possible subsets regression procedure)

• lasso gives better prediction accuracy

• Lasso also gives rise to a sparse solution. Of course, for other problems, ridge may give better predic-tive accuracy.

• In practice, a combination of lasso and ridge, known as the elastic net, often performs best, since it provides a good combination of sparsity and regularization (see Section 13.5.3)

13.3.4 Regularization path

• As we increase λ, the solution vector ˆ w(λ) will tend to get sparser, although not necessarily mono-tonically.

• We can plot the values for each feature j; this is known as the regularization path.

• W 가 발현되는 critical 한 시점이 있다 • 한번 fitting 시 feature 마다 critical 한 시점까지 구하는 알고리즘 (LARS, least angle regression

and shrinkage)

13.5.3 Elastic net (ridge and lasso combined)

• 강하게 연관되어 있는 features 들이 많을 때 lasso 는 그 중에 하나를 임의적으로 고르려 한다 .

• In the D>N case, lasso can select at most N variables before it saturates.

• If N>D, but the variables are correlated, it has been empirically observed that the prediction perfor-mance of ridge is better than that of lasso

• grouping effect = 높은 연관 관계가 있는 feature 들은 같은 weight 를 가지려 한다 (lasso 는 고름 )• For example, if two features are equal, so X:j =X:k, one can show that their estimates are also

equal, ˆ wj =ˆwk. • By contrast, with lasso, we may have that ˆ wj =0and ˆ wk=0or vice versa.

• 그니까 Elastic net 은 강한 상관 관계 때문에 없어지는 feature 는 살려주고 , response 랑 관계 없는 feature 만 걸러주는 장점이 있는듯

conclusion




l0 regulization



최적화 어려움


Technology

머피의 머신러닝 13 Sparse Linear Model