40
Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides from Jia Li, Dept. of Statistics) Summer School in Statistics for Astronomers June 2010 1

Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

  • Upload
    lamlien

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Cluster Analysis

Debashis GhoshDepartment of StatisticsPenn State University

(based on slides from Jia Li, Dept. of Statistics)

Summer School in Statistics for AstronomersJune 2010

1

Page 2: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Clustering: Intuition• A basic tool in data mining/pattern recognition:

– Divide a set of data into groups.

– Samples in one cluster are close and clusters arefar apart.

−2 −1 0 1 2 3 4 5 6−3

−2

−1

0

1

2

3

4

5

6

• Motivations:

– Discover classes of data in an unsupervised way(unsupervised learning).

– Efficient representation of data: fast retrieval, datacomplexity reduction.

– Various engineering purposes: tightly linked withpattern recognition.

2

Page 3: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Astrononomical Application of Clustering:Multivariate clustering of gamma ray bursts

• GRBs are extremely rapid (1-100s), powerful (L 1051erg), explosive events occurring at random times innormal galaxies. The GRB itself is followed by anafterglow seen at longer wavelengths (X-ray to radio)for days/months. They probably arise from a sub-class of supernova explosions, colliding binary neu-tron stars, or similar event which produces a colli-mated relativistic fireball. Due to Doppler beaming,we probably see only a small fraction of all GRBs.

• Problem: Based on the properties of the GRB (e.g.,location in the sky, arrival time, duration, fluence,and spectral hardness), can we find subtypes/multipleclasses of events

3

Page 4: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Approaches to Clustering

• Represent samples by feature vectors.

• Define a distance measure to assess the closeness be-tween data.

• “Closeness” can be measured in many ways.

– Define distance based on various norms.

– For stars with measured parallax, the multivariate“distance” between stars is the spatial Euclideandistance. For a galaxy redshift survey, however,the multivariate “distance” depends on the Hubbleconstant which scales velocity to spatial distance.For many astronomical datasets, the variables haveincompatible units and no prior known relation-ship. The result of clustering will depends on thearbitrary choice of variable scaling.

4

Page 5: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Approaches to Clustering

• Clustering: grouping of similar objects (unsupervisedlearning)

• Approaches

– Prototype methods:

∗ K-means (for vectors)∗ K-center (for vectors)∗ D2-clustering (for bags of weighted vectors)

– Statistical modeling

∗ Mixture modeling by the EM algorithm∗ Modal clustering

– Pairwise distance based partition:

∗ Spectral graph partitioning∗ Dendrogram clustering (agglomerative): single

linkage (friends of friends algorithm), completelinkage, etc.

5

Page 6: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Clustering versus Classification

• Recall goal of clustering: find subtypes or groups thatarenot defineda priori based on measurements.

• By contrast, if there area priori group labels and onewishes to use them in an analysis, then clustering isnot the method to use!!

• Instead one uses classification methodologies

In computer science, clustering is termedunsuper-vised learning, while classification is termedsuper-vised learning.

6

Page 7: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Philosophies of Clustering

• Parametric versus nonparametric

• Probabilistic versus algorithmic

• Pros and cons of each approach

• With any method, must realize thata notion of dis-tance is used.

7

Page 8: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

K-means

• Assume there areM prototypes (observations) de-noted by

Z = z1, z2, ..., zM .

• Each training sample is assigned to one of the proto-type. Denote the assignment function byA(·). ThenA(xi) = j means theith training sample is assignedto thejth prototype.

• Goal: minimize the total mean squared error betweenthe training samples and their representative proto-types, that is, the trace of the pooled within clustercovariance matrix.

arg minZ,A

N∑

i=1

‖ xi − zA(xi) ‖2

• Denote the objective function by

L(Z , A) =

N∑

i=1

‖ xi − zA(xi) ‖2 .

• Intuition: training samples are tightly clustered aroundthe prototypes. Hence, the prototypes serve as a com-pact representation for the training data.

8

Page 9: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Necessary Conditions

• If Z is fixed, the optimal assignment functionA(·)should follow the nearest neighbor rule, that is,

A(xi) = arg minj∈1,2,...,M ‖ xi − zj ‖ .

• If A(·) is fixed, the prototypezj should be the aver-age (centroid) of all the samples assigned to thejthprototype:

zj =

i:A(xi)=j xi

Nj,

whereNj is the number of samples assigned to pro-totypej.

9

Page 10: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

The Algorithm

• Based on the necessary conditions, the k-means algo-rithm alternates the two steps:

– For a fixed set of centroids (prototypes), optimizeA(·) by assigning each sample to its closest cen-troid using Euclidean distance.

– Update the centroids by computing the average ofall the samples assigned to it.

• The algorithm converges since after each iteration,the objective function decreases (non-increasing).

• Usually converges fast.

• Stopping criterion: the ratio between the decrease andthe objective function is below a threshold.

10

Page 11: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Example

• Training set:1.2, 5.6, 3.7, 0.6, 0.1, 2.6.

• Apply k-means algorithm with2 centroids,z1, z2.

• Initialization: randomly pickz1 = 2, z2 = 5.

fixed update2 1.2, 0.6, 0.1, 2.65 5.6, 3.7

1.2, 0.6, 0.1, 2.6 1.1255.6, 3.7 4.65

1.125 1.2, 0.6, 0.1, 2.64.65 5.6, 3.7

The two prototypes are:z1 = 1.125, z2 = 4.65. Theobjective function isL(Z , A) = 5.3125.

11

Page 12: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

• Initialization: randomly pickz1 = 0.8, z2 = 3.8.

fixed update0.8 1.2, 0.6, 0.13.8 5.6, 3.7, 2.6

1.2, 0.6, 0.1 0.6335.6, 3.7, 2.6 3.967

0.633 1.2, 0.6, 0.13.967 5.6, 3.7, 2.6

The two prototypes are:z1 = 0.633, z2 = 3.967. Theobjective function isL(Z , A) = 5.2133.

• Starting from different initial values, the k-means al-gorithm converges to different local optimum.

• It can be shown thatz1 = 0.633, z2 = 3.967 is theglobal optimal solution.

12

Page 13: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Initialization

• Randomly pick up the prototypes to start the k-meansiteration.

• Different initial prototypes may lead to different localoptimal solutions given by k-means.

• Try different sets of initial prototypes, compare theobjective function at the end to choose the best solu-tion.

• When randomly select initial prototypes, better makesure no prototype is out of the range of the entire dataset.

• Initialization in the above simulation:

– GeneratedM random vectors with independent di-mensions. For each dimension, the feature is uni-formly distributed in[−1, 1].

– Linearly transform thejth feature,Zj, j = 1, 2, ..., pin each prototype (a vector) by:Zjsj + mj, wheresj is the sample standard deviation of dimensionjandmj is the sample mean of dimensionj, bothcomputed using the training data.

13

Page 14: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Tree-structured Clustering

• Studied extensively in vector quantization from theperspective of data compression.

• Referred to as tree-structured vector quantization (TSVQ).

• The algorithm

1. Apply 2 centroids k-means to the entire data set.

2. The data are assigned to the2 centroids.

3. For the data assigned to each centroid, apply2 cen-troids k-means to them separately.

4. Repeat the above step.

14

Page 15: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Comments on tree-structured clustering

• It is structurally more constrained. But on the otherhand, it provides more insight into the patterns in thedata.

• It is greedy in the sense of optimizing at each stepsequentially. An early bad decision will propagate itseffect.

• It provides more algorithmic flexibility.

15

Page 16: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

K-center Clustering

• Let A be a set ofn objects.

• PartitionA into K setsC1, C2, ...,CK.

• Cluster size of Ck: the least valueD for which allpoints inCk are:

1. within distanceD of each other, or

2. within distanceD/2 of some point called the clus-ter center.

• Let the cluster size ofCk beDk.

• Thecluster size of partitionS is

D = maxk=1,...,K

Dk .

• Goal: GivenK, minS D(S).

16

Page 17: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Comparison with k-means

• Assume the distance between vectors is the squaredEuclidean distance.

• K-means:

minS

K∑

k=1

i:xi∈Ck

(xi − µk)T (xi − µk)

whereµk is the centroid for clusterCk. In particular,

µk =1

Nk

i:xi∈Ck

xi .

• K-center:

minS

maxk=1,...,K

maxi:xi∈Ck

(xi − µk)T (xi − µk) .

whereµk is called the “centroid”, but may not be themean vector.

• Another formulation of k-center:

minS

maxk=1,...,K

maxi,j:xi,xj∈Ck

L(xi, xj) .

L(xi, xj) denotes any distance between a pair of ob-jects.

17

Page 18: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

−5 0 5 10−6

−4

−2

0

2

4

6

8

(a)

−5 0 5 10−6

−4

−2

0

2

4

6

8

−5 0 5 10−6

−4

−2

0

2

4

6

8

(b) (c)

Figure 1: Comparison of k-means and k-center. (a): Originalunclustered data. (b): Clustering byk-means. (c): Clustering by k-center. K-means focuses on average distance. K-center focuses onworst scenario.

18

Page 19: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Agglomerative Clustering

• Generate clusters in a hierarchical way.

• Let the data set beA = x1, ..., xn.

• Start withn clusters, each containing one data point.

• Merge the two clusters with minimum pairwise dis-tance.

• Update between-cluster distance.

• Iterate the merging procedure.

• The clustering procedure can be visualized by a treestructure calleddendrogram.

• Definition for between-cluster distance?

– For clusters containing only one data point, thebetween-cluster distance is the between-object dis-tance.

– For clusters containing multiple data points, thebetween-cluster distance is an agglomerative ver-sion of the between-object distances.

∗ Examples: minimum or maximum between-objectsdistances for objects in the two clusters.

19

Page 20: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

– The agglomerative between-cluster distance canoften be computed recursively.

20

Page 21: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Example Distances

• Suppose clusterr ands are two clusters merged intoa new clustert. Let k be any other cluster.

• Denote between-cluster distance byD(·, ·).

• How to getD(t, k) from D(r, k) andD(s, k)?

– Single-link clustering:

D(t, k) = min(D(r, k), D(s, k))

D(t, k) is theminimum distance between two ob-jects in clustert andk respectively.

– Complete-link clustering:

D(t, k) = max(D(r, k), D(s, k))

D(t, k) is themaximum distance between two ob-jects in clustert andk respectively.

– Average linkage clustering:Unweighted case:

D(t, k) =nr

nr + nsD(r, k) +

ns

nr + nsD(s, k)

Weighted case:

D(t, k) =1

2D(r, k) +

1

2D(s, k)

21

Page 22: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

D(t, k) is the average distance between two ob-jects in clustert andk respectively.For the unweighted case, the number of elementsin each cluster is taken into consideration, while inthe weighted case each cluster is weighted equally.So objects in smaller cluster are weighted moreheavily than those in larger clusters.

– Centroid clustering:Unweighted case:

D(t, k) =nr

nr + nsD(r, k) +

ns

nr + nsD(s, k)

−nrns

nr + nsD(r, s)

Weighted case:

D(t, k) =1

2D(r, k) +

1

2D(s, k) −

1

4D(r, s)

A centroid is computed for each cluster and thedistance between clusters is given by the distancebetween their respective centroids.

22

Page 23: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

– Ward’s clustering:

D(t, k) =nr + nk

nr + ns + nkD(r, k)

+ns + nk

nr + ns + nkD(s, k)

−nk

nr + ns + nkD(r, s)

Merge the two clusters for which the change in thevariance of the clustering is minimized. The vari-ance of a cluster is defined as the sum of squared-error between each object in the cluster and thecentroid of the cluster.

• The dendrogram generated by single-link clusteringtends to look like a chain. Clusters generated by complete-link may not be well separated. Other methods areintermediates between the two.

23

Page 24: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

−4 −2 0 2 4 6 8 10−4

−2

0

2

4

6

8

−4 −2 0 2 4 6 8 10−4

−2

0

2

4

6

8

(a) (b)

−4 −2 0 2 4 6 8 10−4

−2

0

2

4

6

8

−4 −2 0 2 4 6 8 10−4

−2

0

2

4

6

8

(c) (d)

Figure 2: Agglomerate clustering of a data set (100 points) into 9 clusters. (a): Single-link, (b):Complete-link, (c): Average linkage, (d) Wards clustering

24

Page 25: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Hipparcos Data

• Clustering based onlog L andBV .

−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Kcenter clustering for Hipparcos data

log

L

BV−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Kmeans clustering for Hipparcos data

log

L

BV

(a) K-center #clusters=4 (b) K-means #clusters=4

−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4EM clustering for Hipparcos data

log

L

BV−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4EM clustering for Hipparcos data

log

L

BV

(c) EM #clusters=4 (d) EM #clusters=3

Figure 3: Clustering of the Hipparcos data

25

Page 26: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Single linkage clustering for Hipparcos data

log

L

BV−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Complete linkage clustering for Hipparcos data

log

L

BV

(a) Single linkage #clusters=20 (b) Complete linkage #clusters=10

−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Average linkage clustering for Hipparcos data

log

L

BV−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Wards linkage clustering for Hipparcos data

log

L

BV

(c) Average linkage #clusters=10 (d) Ward’s linkage #clusters=10

Figure 4: Clustering of the Hipparcos data

26

Page 27: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Single linkage clustering for Hipparcos data

log

L

BV−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Complete linkage clustering for Hipparcos data

log

L

BV

(a) Single linkage #clusters=4 (b) Complete linkage #clusters=4

−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Average linkage clustering for Hipparcos data

log

L

BV−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Wards linkage clustering for Hipparcos data

log

L

BV

(c) Average linkage #clusters=4 (d) Ward’s linkage #clusters=4

Figure 5: Clustering of the Hipparcos data

27

Page 28: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Mixture Model-based Clustering

• Each cluster is mathematically represented by a para-metric distribution. Examples: Gaussian (continu-ous), Poisson (discrete).

• The entire data set is modeled by a mixture of thesedistributions.

• An individual distribution used to model a specificcluster is often referred to as a component distribu-tion.

• Suppose there areK components (clusters). Eachcomponent is a Gaussian distribution parameterizedby µk, Σk. Denote the data byX, X ∈ Rd. Thedensity of componentk is

fk(x) = φ(x | µk, Σk)

=1

(2π)d|Σk|exp(

−(x − µk)tΣ−1

k (x − µk)

2) .

• The prior probability (weight) of componentk is ak.The mixture density is:

f(x) =

K∑

k=1

akfk(x) =

K∑

k=1

akφ(x | µk, Σk) .

28

Page 29: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Advantages

• A mixture model with high likelihood tends to havethe following traits:

– Component distributions have high “peaks” (datain one cluster are tight)

– The mixture model “covers” the data well (dom-inant patterns in the data are captured by compo-nent distributions).

• Advantages

– Well-studied statistical inference techniques avail-able.

– Flexibility in choosing the component distributions.

– Obtain a density estimation for each cluster.

– A “soft” classification is available.

29

Page 30: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

−8 −6 −4 −2 0 2 4 6 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Den

sity

func

tion

of tw

o cl

uste

rs

EM Algorithm

• The parameters are estimated by the maximum like-lihood (ML) criterion using the EM algorithm.

• The EM algorithm provides an iterative computationof maximum likelihood estimation when the observeddata are incomplete.

• Incompleteness can be conceptual.

– We need to estimate the distribution ofX, in sam-ple spaceX , but we can only observeX indirectlythroughY , in sample spaceY.

– In many cases, there is a mappingx → y(x) fromX to Y, andx is only known to lie in a subset ofX , denoted byX (y), which is determined by theequationy = y(x).

30

Page 31: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

– The distribution ofX is parameterized by a familyof distributionsf(x | θ), with parametersθ ∈ Ω,onx. The distribution ofy, g(y | θ) is

g(y | θ) =

X (y)

f(x | θ)dx .

• The EM algorithm aims at finding aθ that maximizesg(y | θ) given an observedy.

• Introduce the function

Q(θ′ | θ) = E(log f(x | θ′) | y, θ) ,

that is, the expected value oflog f(x | θ′) according tothe conditional distribution ofx giveny and parame-terθ. The expectation is assumed to exist for all pairs(θ′, θ). In particular, it is assumed thatf(x | θ) > 0for θ ∈ Ω.

• EM Iteration :

– E-step: ComputeQ(θ | θ(p)).

– M-step: Chooseθ(p+1) to be a value ofθ ∈ Ω thatmaximizesQ(θ | θ(p)).

31

Page 32: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

EM for the Mixture of Normals

• Observed data (incomplete):x1, x2, ..., xn, wherenis the sample size. Denote all the samples collectivelyby x.

• Complete data:(x1, y1), (x2, y2), ..., (xn, yn), whereyi is the cluster (component) identity of samplexi.

• The collection of parameters,θ, includes:ak, µk, Σk,k = 1, 2, ...,K.

• The likelihood function is:

L(x|θ) =n∑

i=1

log

(

K∑

k=1

akφ(xi|µk, Σk)

)

.

• L(x|θ) is the objective function of the EM algorithm(maximize). Numerical difficulty comes from the suminside the log.

32

Page 33: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

• TheQ function is:

Q(θ′|θ) = E

[

logn∏

i=1

a′yiφ(xi | µ′

yi, Σ′

yi) | x, θ

]

= E

[

n∑

i=1

(

log(a′yi) + log φ(xi | µ′

yi, Σ′

yi

)

| x, θ

]

=

n∑

i=1

E[

log(a′yi) + log φ(xi | µ′

yi, Σ′

yi) | xi, θ

]

.

The last equality comes from the fact the samples areindependent.

• Note that whenxi is given, onlyyi is random in thecomplete data(xi, yi). Alsoyi only takes a finite num-ber of values, i.e, cluster identities1 to K. The distri-bution ofY givenX = xi is the posterior probabilityof Y givenX.

• Denote the posterior probabilities ofY = k, k =1, ...,K given xi by pi,k. By the Bayes formula, theposterior probabilities are:

pi,k ∝ akφ(xi | µk, Σk),K∑

k=1

pi,k = 1 .

33

Page 34: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

• Then each summand inQ(θ′|θ) is

E[

log(a′yi) + log φ(xi | µ′

yi, Σ′

yi) | xi, θ

]

=K∑

k=1

pi,k log a′k +K∑

k=1

pi,k log φ(xi | µ′k, Σ

′k) .

• Note that we cannot see the direct effect ofθ in theabove equation, butpi,k are computed usingθ, i.e, thecurrent parameters.θ′ includes the updated parame-ters.

• We then have:

Q(θ′|θ) =n∑

i=1

K∑

k=1

pi,k log a′k +

n∑

i=1

K∑

k=1

pi,k log φ(xi | µ′k, Σ

′k)

• Note that the prior probabilitiesa′k and the parametersof the Gaussian componentsµ′

k, Σ′k can be optimized

separately.

34

Page 35: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

• Thea′k’s subject to∑K

k=1 a′k = 1. Basic optimizationtheories show thata′k are optimized by

a′k =

∑ni=1 pi,k

n.

• The optimization ofµk andΣk is simply a maximumlikelihood estimation of the parameters using samplesxi with weightspi,k. Basic optimization techniquesalso lead to

µ′k =

∑ni=1 pi,kxi∑n

i=1 pi,k

Σ′k =

∑ni=1 pi,k(xi − µ′

k)(xi − µ′k)

t

∑ni=1 pi,k

• After every iteration, the likelihood functionL is guar-anteed to increase (may not strictly).

• We have derived the EM algorithm for a mixture ofGaussians.

35

Page 36: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

EM Algorithm for the Mixture of Gaussians

Parameters estimated at thepth iteration are marked bya superscript(p).

1. Initialize parameters

2. E-step: Compute the posterior probabilities for alli =1, ..., n, k = 1, ...,K.

pi,k =a

(p)k φ(xi | µ

(p)k , Σ

(p)k )

∑Kk=1 a

(p)k φ(xi | µ

(p)k , Σ

(p)k )

.

3. M-step:

a(p+1)k =

∑ni=1 pi,k

n

µ(p+1)k =

∑ni=1 pi,kxi∑n

i=1 pi,k

Σ(p+1)k =

∑ni=1 pi,k(xi − µ

(p+1)k )(xi − µ

(p+1)k )t

∑ni=1 pi,k

4. Repeat step2 and3 until converge.

Comment: for mixtures of other distributions, the EMalgorithm is very similar. The E-step involves comput-

36

Page 37: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

ing the posterior probabilities. Only the particular dis-tribution φ needs to be changed. The M-step always in-volves parameter optimization. Formulas differ accord-ing to distributions.

37

Page 38: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Computation Issues

• If a different Σk is allowed for each component, thelikelihood function is not bounded. Global optimumis meaningless. (Don’t overdo it!)

• How to initialize? Example:

– Apply k-means first.

– Initializeµk andΣk using all the samples classifiedto clusterk.

– Initialize ak by the proportion of data assigned toclusterk by k-means.

• In practice, we may want to reduce model complex-ity by putting constraints on the parameters. For in-stance, assume equal priors, identical covariance ma-trices for all the components.

38

Page 39: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

Examples

• The heart disease data set is taken from the UCI ma-chine learning database repository.

• There are297 cases (samples) in the data set, of which137 have heart diseases. Each sample contains13quantitative variables, including cholesterol, max heartrate, etc.

• We remove the mean of each variable and normalizeit to yield unit variance.

• data are projected onto the plane spanned by the twomost dominant principal component directions.

• A two-component Gaussian mixture is fit.

39

Page 40: Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster Analysis Debashis Ghosh Department of Statistics Penn State University (based on slides

−4 −3 −2 −1 0 1 2 3 4 5−4

−3

−2

−1

0

1

2

3

−4 −3 −2 −1 0 1 2 3 4 5−4

−3

−2

−1

0

1

2

3

Figure 6: The heart disease data set and the estimated cluster densi-ties. Top: The scatter plot of the data. Bottom: The contour plot ofthe pdf estimated using a single-layer mixture of two normals. Thethick lines are the boundaries between the two clusters based on theestimated pdfs of individual clusters.

40