18
3/1/2017 1 ©2017 Emily Fox CSE 446: Machine Learning Emily Fox University of Washington March 1, 2017 Expectation Maximization cont’d Recap of EM so far… ©2017 Emily Fox

Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

1

CSE 446: Machine Learning©2017 Emily Fox

CSE 446: Machine LearningEmily FoxUniversity of WashingtonMarch 1, 2017

ExpectationMaximization cont’d

CSE 446: Machine Learning

Recap of EM so far…

©2017 Emily Fox

Page 2: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

2

CSE 446: Machine Learning

Part 1:What if we knew the cluster parameters {πk,μk ,Σk }?

©2017 Emily Fox

CSE 446: Machine Learning4 ©2017 Emily Fox

Responsibility cluster k takes for observation i

Normalized over all possible cluster assignments

Responsibilities in equations

Page 3: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

3

CSE 446: Machine Learning5

Part 1 summary

©2017 Emily Fox

Desired soft assignments (responsibilities) are easy to compute when cluster parameters {πk , μk , Σk } are known

But, we don’t know these!

CSE 446: Machine Learning

Part 2b:What can we do with just soft assignments rij?

©2017 Emily Fox

Page 4: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

4

CSE 446: Machine Learning7

Estimating cluster parameters from soft assignments

Instead of having a full observation xi

in cluster k, just allocate a portion rik

©2017 Emily Fox

xi divided across all clusters,as determined by rik

CSE 446: Machine Learning8

Maximum likelihood estimation from soft assignments

Just like in boosting with weighted observations…

©2017 Emily Fox

R G B ri1 ri2 ri3

x1[1] x1[2] x1[3] 0.30 0.18 0.52

x2[1] x2[2] x2[3] 0.01 0.26 0.73

x3[1] x3[2] x3[3] 0.002 0.008 0.99

x4[1] x4[2] x4[3] 0.75 0.10 0.15

x5[1] x5[2] x5[3] 0.05 0.93 0.02

x6[1] x6[2] x6[3] 0.13 0.86 0.01

Total weight in cluster:(effective # of obs)

1.242 2.8 2.42

52% chance this obs is in cluster 3

Page 5: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

5

CSE 446: Machine Learning9

Maximum likelihood estimation from soft assignments

©2017 Emily Fox

R G B Cluster 1 weights

x1[1] x1[2] x1[3] 0.30

x2[1] x2[2] x2[3] 0.01

x3[1] x3[2] x3[3] 0.002

x4[1] x4[2] x4[3] 0.75

x5[1] x5[2] x5[3] 0.05

x6[1] x6[2] x6[3] 0.13

R G B Cluster 2weights

x1[1] x1[2] x1[3] 0.18

x2[1] x2[2] x2[3] 0.26

x3[1] x3[2] x3[3] 0.008

x4[1] x4[2] x4[3] 0.10

x5[1] x5[2] x5[3] 0.93

x6[1] x6[2] x6[3] 0.86

R G B Cluster 3 weights

x1[1] x1[2] x1[3] 0.52

x2[1] x2[2] x2[3] 0.73

x3[1] x3[2] x3[3] 0.99

x4[1] x4[2] x4[3] 0.15

x5[1] x5[2] x5[3] 0.02

x6[1] x6[2] x6[3] 0.01

CSE 446: Machine Learning10

Cluster-specific location/shape MLE

©2017 Emily Fox

Compute cluster parameter estimateswith weights on each row operation

R G B Cluster 1 weights

x1[1] x1[2] x1[3] 0.30

x2[1] x2[2] x2[3] 0.01

x3[1] x3[2] x3[3] 0.002

x4[1] x4[2] x4[3] 0.75

x5[1] x5[2] x5[3] 0.05

x6[1] x6[2] x6[3] 0.13

Total weight in cluster k= effective # obs

Page 6: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

6

CSE 446: Machine Learning11

MLE of cluster proportions

©2017 Emily Fox

Estimate cluster proportions from relative weights

ri1 ri2 ri3

0.30 0.18 0.52

0.01 0.26 0.73

0.002 0.008 0.99

0.75 0.10 0.15

0.05 0.93 0.02

0.13 0.86 0.01

Total weight in cluster: 1.242 2.8 2.42

Total weight in dataset: 6

# datapoints N

Total weight in cluster k= effective # obs

CSE 446: Machine Learning12

Part 2b summary

Still straightforward to compute cluster parameter estimates from soft assignments

©2017 Emily Fox

Page 7: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

7

CSE 446: Machine Learning

Expectation maximization (EM)

©2017 Emily Fox

CSE 446: Machine Learning14

Expectation maximization (EM):An iterative algorithm

Motivates an iterative algorithm:

1. E-step: estimate cluster responsibilities given current parameter estimates

2. M-step: maximize likelihood over parameters given current responsibilities

©2017 Emily Fox

Page 8: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

8

CSE 446: Machine Learning15

EM for mixtures of Gaussians in pictures –initialization

©2017 Emily Fox

CSE 446: Machine Learning16

EM for mixtures of Gaussians in pictures –after 1st iteration

©2017 Emily Fox

Page 9: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

9

CSE 446: Machine Learning17

EM for mixtures of Gaussians in pictures –after 2nd iteration

©2017 Emily Fox

CSE 446: Machine Learning18

EM for mixtures of Gaussians in pictures –converged solution

©2017 Emily Fox

Page 10: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

10

CSE 446: Machine Learning19

EM for mixtures of Gaussians in pictures –replay

©2017 Emily Fox

CSE 446: Machine Learning

The nitty gritty of EM

©2017 Emily Fox

Page 11: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

11

CSE 446: Machine Learning

Convergence and initialization of EM

©2017 Emily Fox

CSE 446: Machine Learning22

Convergence of EM

• EM is a coordinate-ascent algorithm- Can equate E-and M-steps with alternating maximizations

of an objective function

• Convergences to a local mode

• We will assess via (log) likelihood of data under current parameter and responsibility estimates

©2017 Emily Fox

Page 12: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

12

CSE 446: Machine Learning23

Initialization

• Many ways to initialize the EM algorithm

• Important for convergence rates & quality of local mode found

• Examples:- Choose K observations at random to define K “centroids”. Assign other

observations to nearest centriod to form initial parameter estimates.

- Pick centers sequentially to provide good coverage of data like in k-means++

- Initialize from k-means solution

- Grow mixture model by splitting (and sometimes removing) clusters until K clusters are formed

©2017 Emily Fox

CSE 446: Machine Learning

Potential of vanilla EM to overfit

©2017 Emily Fox

Page 13: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

13

CSE 446: Machine Learning25

Overfitting of MLE

Maximizing likelihood can overfit to data

Imagine at K=2 example with one obs assigned to cluster 1 and others assigned to cluster 2

- What parameter values maximize likelihood?

©2017 Emily Fox

Set center equal to point and shrink

variance to 0

Likelihood goes to ∞ !

CSE 446: Machine Learning26

Overfitting in high dims

Doc-clustering example: Imagine only 1 doc assigned to cluster k has word w (or all docs in cluster agree on count of word w)

Likelihood of any doc with different count on word w being in cluster k is 0!

©2017 Emily Fox

Likelihood maximized by setting µk[w] = xi[w] and σw,k = 02

Page 14: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

14

CSE 446: Machine Learning27

Simple regularization of M-step for mixtures of Gaussians

Simple fix: Don’t let variances 0!

Add small amount to diagonal of covariance estimate

©2017 Emily Fox

Alternatively, take Bayesian approach & place prior on parameters.

Similar idea, but all parameter estimates are “smoothed” via cluster pseudo-observations.

CSE 446: Machine Learning

Formally relating k-means and EM for mixtures of Gaussians

©2017 Emily Fox

Page 15: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

15

CSE 446: Machine Learning29

Relationship to k-means

Consider Gaussian mixture model with

and let the variance parameter σ 0

©2017 Emily Fox

Σ =

Spherically symmetric clusters

Datapoint gets fully assigned to nearest center, just as in k-means

σ2

σ2

σ2

σ2

- Spherical clusters w/ equal variances, so relative likelihoods just fcn of dist to cluster center

- As variances0, likelihood ratio becomes 0 or 1

- Responsibilities weigh in cluster proportions, but dominated by likelihood disparity

CSE 446: Machine Learning30

Infinitesimally small variance EM = k-means

1. E-step: estimate cluster responsibilities given current parameter estimates

2. M-step: maximize likelihood over parameters given current responsibilities (hard assignments!)

©2017 Emily Fox

Decision based on distance to nearest cluster center

Infinitesimally small

Page 16: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

16

CSE 446: Machine Learning

Summary for the EM algorithm

©2017 Emily Fox

CSE 446: Machine Learning32

What you can do now…

• Estimate soft assignments (responsibilities) given mixture model parameters

• Solve maximum likelihood parameter estimation using soft assignments (weighted data)

• Implement an EM algorithm for inferring soft assignments and cluster parameters- Determine an initialization strategy- Implement a variant that helps avoid overfitting issues

• Compare and contrast with k-means- Soft vs. hard assignments- k-means as a limiting special case of EM for mixtures of Gaussians

©2017 Emily Fox

Page 17: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

17

CSE 446: Machine Learning©2017 Emily Fox

CSE 446: Machine LearningEmily FoxUniversity of WashingtonMarch 1, 2017

Bayes Optimal Classifier &Naïve Bayes

CSE 446: Machine Learning34

Classification

Learn: f: X Y- X – features- Y – target classes

Suppose you know P(Y|X) exactly, how should you classify?- Bayes optimal classifier:

©2017 Emily Fox

Page 18: Expectation Maximization cont’d · 2017. 3. 2. · Maximization cont’d CSE 446: Machine Learning ... Part 2b summary Still straightforward to compute cluster parameter estimates

3/1/2017

18

CSE 446: Machine Learning35

Recall: Bayes rule

Which is shorthand for:

©2017 Emily Fox

CSE 446: Machine Learning36

How hard is it to learn the optimal classifier?

• Data =

• How do we represent these? How many parameters?- Prior, P(Y):

• Suppose Y is composed of k classes

- Likelihood, P(X|Y):• Suppose X is composed of d binary features

• Complex model ! High variance with limited data!!!

Sky Temp Humid Wind Water Forecast EnjoySpt

Sunny Warm Normal Strong Warm Same Yes

Sunny Warm High Strong Warm Same Yes

Rainy Cold High Strong Warm Change No

Sunny Warm High Strong Cold Change Yes

©2017 Emily Fox