Cluster Analysis - Astrostatisticsastrostatistics.psu.edu/su10/lectures/cluster10.pdf · Cluster...

Preview:

Citation preview

Cluster Analysis

Debashis GhoshDepartment of StatisticsPenn State University

(based on slides from Jia Li, Dept. of Statistics)

Summer School in Statistics for AstronomersJune 2010

1

Clustering: Intuition• A basic tool in data mining/pattern recognition:

– Divide a set of data into groups.

– Samples in one cluster are close and clusters arefar apart.

−2 −1 0 1 2 3 4 5 6−3

−2

−1

0

1

2

3

4

5

6

• Motivations:

– Discover classes of data in an unsupervised way(unsupervised learning).

– Efficient representation of data: fast retrieval, datacomplexity reduction.

– Various engineering purposes: tightly linked withpattern recognition.

2

Astrononomical Application of Clustering:Multivariate clustering of gamma ray bursts

• GRBs are extremely rapid (1-100s), powerful (L 1051erg), explosive events occurring at random times innormal galaxies. The GRB itself is followed by anafterglow seen at longer wavelengths (X-ray to radio)for days/months. They probably arise from a sub-class of supernova explosions, colliding binary neu-tron stars, or similar event which produces a colli-mated relativistic fireball. Due to Doppler beaming,we probably see only a small fraction of all GRBs.

• Problem: Based on the properties of the GRB (e.g.,location in the sky, arrival time, duration, fluence,and spectral hardness), can we find subtypes/multipleclasses of events

3

Approaches to Clustering

• Represent samples by feature vectors.

• Define a distance measure to assess the closeness be-tween data.

• “Closeness” can be measured in many ways.

– Define distance based on various norms.

– For stars with measured parallax, the multivariate“distance” between stars is the spatial Euclideandistance. For a galaxy redshift survey, however,the multivariate “distance” depends on the Hubbleconstant which scales velocity to spatial distance.For many astronomical datasets, the variables haveincompatible units and no prior known relation-ship. The result of clustering will depends on thearbitrary choice of variable scaling.

4

Approaches to Clustering

• Clustering: grouping of similar objects (unsupervisedlearning)

• Approaches

– Prototype methods:

∗ K-means (for vectors)∗ K-center (for vectors)∗ D2-clustering (for bags of weighted vectors)

– Statistical modeling

∗ Mixture modeling by the EM algorithm∗ Modal clustering

– Pairwise distance based partition:

∗ Spectral graph partitioning∗ Dendrogram clustering (agglomerative): single

linkage (friends of friends algorithm), completelinkage, etc.

5

Clustering versus Classification

• Recall goal of clustering: find subtypes or groups thatarenot defineda priori based on measurements.

• By contrast, if there area priori group labels and onewishes to use them in an analysis, then clustering isnot the method to use!!

• Instead one uses classification methodologies

In computer science, clustering is termedunsuper-vised learning, while classification is termedsuper-vised learning.

6

Philosophies of Clustering

• Parametric versus nonparametric

• Probabilistic versus algorithmic

• Pros and cons of each approach

• With any method, must realize thata notion of dis-tance is used.

7

K-means

• Assume there areM prototypes (observations) de-noted by

Z = z1, z2, ..., zM .

• Each training sample is assigned to one of the proto-type. Denote the assignment function byA(·). ThenA(xi) = j means theith training sample is assignedto thejth prototype.

• Goal: minimize the total mean squared error betweenthe training samples and their representative proto-types, that is, the trace of the pooled within clustercovariance matrix.

arg minZ,A

N∑

i=1

‖ xi − zA(xi) ‖2

• Denote the objective function by

L(Z , A) =

N∑

i=1

‖ xi − zA(xi) ‖2 .

• Intuition: training samples are tightly clustered aroundthe prototypes. Hence, the prototypes serve as a com-pact representation for the training data.

8

Necessary Conditions

• If Z is fixed, the optimal assignment functionA(·)should follow the nearest neighbor rule, that is,

A(xi) = arg minj∈1,2,...,M ‖ xi − zj ‖ .

• If A(·) is fixed, the prototypezj should be the aver-age (centroid) of all the samples assigned to thejthprototype:

zj =

i:A(xi)=j xi

Nj,

whereNj is the number of samples assigned to pro-totypej.

9

The Algorithm

• Based on the necessary conditions, the k-means algo-rithm alternates the two steps:

– For a fixed set of centroids (prototypes), optimizeA(·) by assigning each sample to its closest cen-troid using Euclidean distance.

– Update the centroids by computing the average ofall the samples assigned to it.

• The algorithm converges since after each iteration,the objective function decreases (non-increasing).

• Usually converges fast.

• Stopping criterion: the ratio between the decrease andthe objective function is below a threshold.

10

Example

• Training set:1.2, 5.6, 3.7, 0.6, 0.1, 2.6.

• Apply k-means algorithm with2 centroids,z1, z2.

• Initialization: randomly pickz1 = 2, z2 = 5.

fixed update2 1.2, 0.6, 0.1, 2.65 5.6, 3.7

1.2, 0.6, 0.1, 2.6 1.1255.6, 3.7 4.65

1.125 1.2, 0.6, 0.1, 2.64.65 5.6, 3.7

The two prototypes are:z1 = 1.125, z2 = 4.65. Theobjective function isL(Z , A) = 5.3125.

11

• Initialization: randomly pickz1 = 0.8, z2 = 3.8.

fixed update0.8 1.2, 0.6, 0.13.8 5.6, 3.7, 2.6

1.2, 0.6, 0.1 0.6335.6, 3.7, 2.6 3.967

0.633 1.2, 0.6, 0.13.967 5.6, 3.7, 2.6

The two prototypes are:z1 = 0.633, z2 = 3.967. Theobjective function isL(Z , A) = 5.2133.

• Starting from different initial values, the k-means al-gorithm converges to different local optimum.

• It can be shown thatz1 = 0.633, z2 = 3.967 is theglobal optimal solution.

12

Initialization

• Randomly pick up the prototypes to start the k-meansiteration.

• Different initial prototypes may lead to different localoptimal solutions given by k-means.

• Try different sets of initial prototypes, compare theobjective function at the end to choose the best solu-tion.

• When randomly select initial prototypes, better makesure no prototype is out of the range of the entire dataset.

• Initialization in the above simulation:

– GeneratedM random vectors with independent di-mensions. For each dimension, the feature is uni-formly distributed in[−1, 1].

– Linearly transform thejth feature,Zj, j = 1, 2, ..., pin each prototype (a vector) by:Zjsj + mj, wheresj is the sample standard deviation of dimensionjandmj is the sample mean of dimensionj, bothcomputed using the training data.

13

Tree-structured Clustering

• Studied extensively in vector quantization from theperspective of data compression.

• Referred to as tree-structured vector quantization (TSVQ).

• The algorithm

1. Apply 2 centroids k-means to the entire data set.

2. The data are assigned to the2 centroids.

3. For the data assigned to each centroid, apply2 cen-troids k-means to them separately.

4. Repeat the above step.

14

Comments on tree-structured clustering

• It is structurally more constrained. But on the otherhand, it provides more insight into the patterns in thedata.

• It is greedy in the sense of optimizing at each stepsequentially. An early bad decision will propagate itseffect.

• It provides more algorithmic flexibility.

15

K-center Clustering

• Let A be a set ofn objects.

• PartitionA into K setsC1, C2, ...,CK.

• Cluster size of Ck: the least valueD for which allpoints inCk are:

1. within distanceD of each other, or

2. within distanceD/2 of some point called the clus-ter center.

• Let the cluster size ofCk beDk.

• Thecluster size of partitionS is

D = maxk=1,...,K

Dk .

• Goal: GivenK, minS D(S).

16

Comparison with k-means

• Assume the distance between vectors is the squaredEuclidean distance.

• K-means:

minS

K∑

k=1

i:xi∈Ck

(xi − µk)T (xi − µk)

whereµk is the centroid for clusterCk. In particular,

µk =1

Nk

i:xi∈Ck

xi .

• K-center:

minS

maxk=1,...,K

maxi:xi∈Ck

(xi − µk)T (xi − µk) .

whereµk is called the “centroid”, but may not be themean vector.

• Another formulation of k-center:

minS

maxk=1,...,K

maxi,j:xi,xj∈Ck

L(xi, xj) .

L(xi, xj) denotes any distance between a pair of ob-jects.

17

−5 0 5 10−6

−4

−2

0

2

4

6

8

(a)

−5 0 5 10−6

−4

−2

0

2

4

6

8

−5 0 5 10−6

−4

−2

0

2

4

6

8

(b) (c)

Figure 1: Comparison of k-means and k-center. (a): Originalunclustered data. (b): Clustering byk-means. (c): Clustering by k-center. K-means focuses on average distance. K-center focuses onworst scenario.

18

Agglomerative Clustering

• Generate clusters in a hierarchical way.

• Let the data set beA = x1, ..., xn.

• Start withn clusters, each containing one data point.

• Merge the two clusters with minimum pairwise dis-tance.

• Update between-cluster distance.

• Iterate the merging procedure.

• The clustering procedure can be visualized by a treestructure calleddendrogram.

• Definition for between-cluster distance?

– For clusters containing only one data point, thebetween-cluster distance is the between-object dis-tance.

– For clusters containing multiple data points, thebetween-cluster distance is an agglomerative ver-sion of the between-object distances.

∗ Examples: minimum or maximum between-objectsdistances for objects in the two clusters.

19

– The agglomerative between-cluster distance canoften be computed recursively.

20

Example Distances

• Suppose clusterr ands are two clusters merged intoa new clustert. Let k be any other cluster.

• Denote between-cluster distance byD(·, ·).

• How to getD(t, k) from D(r, k) andD(s, k)?

– Single-link clustering:

D(t, k) = min(D(r, k), D(s, k))

D(t, k) is theminimum distance between two ob-jects in clustert andk respectively.

– Complete-link clustering:

D(t, k) = max(D(r, k), D(s, k))

D(t, k) is themaximum distance between two ob-jects in clustert andk respectively.

– Average linkage clustering:Unweighted case:

D(t, k) =nr

nr + nsD(r, k) +

ns

nr + nsD(s, k)

Weighted case:

D(t, k) =1

2D(r, k) +

1

2D(s, k)

21

D(t, k) is the average distance between two ob-jects in clustert andk respectively.For the unweighted case, the number of elementsin each cluster is taken into consideration, while inthe weighted case each cluster is weighted equally.So objects in smaller cluster are weighted moreheavily than those in larger clusters.

– Centroid clustering:Unweighted case:

D(t, k) =nr

nr + nsD(r, k) +

ns

nr + nsD(s, k)

−nrns

nr + nsD(r, s)

Weighted case:

D(t, k) =1

2D(r, k) +

1

2D(s, k) −

1

4D(r, s)

A centroid is computed for each cluster and thedistance between clusters is given by the distancebetween their respective centroids.

22

– Ward’s clustering:

D(t, k) =nr + nk

nr + ns + nkD(r, k)

+ns + nk

nr + ns + nkD(s, k)

−nk

nr + ns + nkD(r, s)

Merge the two clusters for which the change in thevariance of the clustering is minimized. The vari-ance of a cluster is defined as the sum of squared-error between each object in the cluster and thecentroid of the cluster.

• The dendrogram generated by single-link clusteringtends to look like a chain. Clusters generated by complete-link may not be well separated. Other methods areintermediates between the two.

23

−4 −2 0 2 4 6 8 10−4

−2

0

2

4

6

8

−4 −2 0 2 4 6 8 10−4

−2

0

2

4

6

8

(a) (b)

−4 −2 0 2 4 6 8 10−4

−2

0

2

4

6

8

−4 −2 0 2 4 6 8 10−4

−2

0

2

4

6

8

(c) (d)

Figure 2: Agglomerate clustering of a data set (100 points) into 9 clusters. (a): Single-link, (b):Complete-link, (c): Average linkage, (d) Wards clustering

24

Hipparcos Data

• Clustering based onlog L andBV .

−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Kcenter clustering for Hipparcos data

log

L

BV−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Kmeans clustering for Hipparcos data

log

L

BV

(a) K-center #clusters=4 (b) K-means #clusters=4

−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4EM clustering for Hipparcos data

log

L

BV−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4EM clustering for Hipparcos data

log

L

BV

(c) EM #clusters=4 (d) EM #clusters=3

Figure 3: Clustering of the Hipparcos data

25

−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Single linkage clustering for Hipparcos data

log

L

BV−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Complete linkage clustering for Hipparcos data

log

L

BV

(a) Single linkage #clusters=20 (b) Complete linkage #clusters=10

−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Average linkage clustering for Hipparcos data

log

L

BV−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Wards linkage clustering for Hipparcos data

log

L

BV

(c) Average linkage #clusters=10 (d) Ward’s linkage #clusters=10

Figure 4: Clustering of the Hipparcos data

26

−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Single linkage clustering for Hipparcos data

log

L

BV−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Complete linkage clustering for Hipparcos data

log

L

BV

(a) Single linkage #clusters=4 (b) Complete linkage #clusters=4

−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Average linkage clustering for Hipparcos data

log

L

BV−0.5 0 0.5 1 1.5 2 2.5 3−2

−1

0

1

2

3

4Wards linkage clustering for Hipparcos data

log

L

BV

(c) Average linkage #clusters=4 (d) Ward’s linkage #clusters=4

Figure 5: Clustering of the Hipparcos data

27

Mixture Model-based Clustering

• Each cluster is mathematically represented by a para-metric distribution. Examples: Gaussian (continu-ous), Poisson (discrete).

• The entire data set is modeled by a mixture of thesedistributions.

• An individual distribution used to model a specificcluster is often referred to as a component distribu-tion.

• Suppose there areK components (clusters). Eachcomponent is a Gaussian distribution parameterizedby µk, Σk. Denote the data byX, X ∈ Rd. Thedensity of componentk is

fk(x) = φ(x | µk, Σk)

=1

(2π)d|Σk|exp(

−(x − µk)tΣ−1

k (x − µk)

2) .

• The prior probability (weight) of componentk is ak.The mixture density is:

f(x) =

K∑

k=1

akfk(x) =

K∑

k=1

akφ(x | µk, Σk) .

28

Advantages

• A mixture model with high likelihood tends to havethe following traits:

– Component distributions have high “peaks” (datain one cluster are tight)

– The mixture model “covers” the data well (dom-inant patterns in the data are captured by compo-nent distributions).

• Advantages

– Well-studied statistical inference techniques avail-able.

– Flexibility in choosing the component distributions.

– Obtain a density estimation for each cluster.

– A “soft” classification is available.

29

−8 −6 −4 −2 0 2 4 6 80

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Den

sity

func

tion

of tw

o cl

uste

rs

EM Algorithm

• The parameters are estimated by the maximum like-lihood (ML) criterion using the EM algorithm.

• The EM algorithm provides an iterative computationof maximum likelihood estimation when the observeddata are incomplete.

• Incompleteness can be conceptual.

– We need to estimate the distribution ofX, in sam-ple spaceX , but we can only observeX indirectlythroughY , in sample spaceY.

– In many cases, there is a mappingx → y(x) fromX to Y, andx is only known to lie in a subset ofX , denoted byX (y), which is determined by theequationy = y(x).

30

– The distribution ofX is parameterized by a familyof distributionsf(x | θ), with parametersθ ∈ Ω,onx. The distribution ofy, g(y | θ) is

g(y | θ) =

X (y)

f(x | θ)dx .

• The EM algorithm aims at finding aθ that maximizesg(y | θ) given an observedy.

• Introduce the function

Q(θ′ | θ) = E(log f(x | θ′) | y, θ) ,

that is, the expected value oflog f(x | θ′) according tothe conditional distribution ofx giveny and parame-terθ. The expectation is assumed to exist for all pairs(θ′, θ). In particular, it is assumed thatf(x | θ) > 0for θ ∈ Ω.

• EM Iteration :

– E-step: ComputeQ(θ | θ(p)).

– M-step: Chooseθ(p+1) to be a value ofθ ∈ Ω thatmaximizesQ(θ | θ(p)).

31

EM for the Mixture of Normals

• Observed data (incomplete):x1, x2, ..., xn, wherenis the sample size. Denote all the samples collectivelyby x.

• Complete data:(x1, y1), (x2, y2), ..., (xn, yn), whereyi is the cluster (component) identity of samplexi.

• The collection of parameters,θ, includes:ak, µk, Σk,k = 1, 2, ...,K.

• The likelihood function is:

L(x|θ) =n∑

i=1

log

(

K∑

k=1

akφ(xi|µk, Σk)

)

.

• L(x|θ) is the objective function of the EM algorithm(maximize). Numerical difficulty comes from the suminside the log.

32

• TheQ function is:

Q(θ′|θ) = E

[

logn∏

i=1

a′yiφ(xi | µ′

yi, Σ′

yi) | x, θ

]

= E

[

n∑

i=1

(

log(a′yi) + log φ(xi | µ′

yi, Σ′

yi

)

| x, θ

]

=

n∑

i=1

E[

log(a′yi) + log φ(xi | µ′

yi, Σ′

yi) | xi, θ

]

.

The last equality comes from the fact the samples areindependent.

• Note that whenxi is given, onlyyi is random in thecomplete data(xi, yi). Alsoyi only takes a finite num-ber of values, i.e, cluster identities1 to K. The distri-bution ofY givenX = xi is the posterior probabilityof Y givenX.

• Denote the posterior probabilities ofY = k, k =1, ...,K given xi by pi,k. By the Bayes formula, theposterior probabilities are:

pi,k ∝ akφ(xi | µk, Σk),K∑

k=1

pi,k = 1 .

33

• Then each summand inQ(θ′|θ) is

E[

log(a′yi) + log φ(xi | µ′

yi, Σ′

yi) | xi, θ

]

=K∑

k=1

pi,k log a′k +K∑

k=1

pi,k log φ(xi | µ′k, Σ

′k) .

• Note that we cannot see the direct effect ofθ in theabove equation, butpi,k are computed usingθ, i.e, thecurrent parameters.θ′ includes the updated parame-ters.

• We then have:

Q(θ′|θ) =n∑

i=1

K∑

k=1

pi,k log a′k +

n∑

i=1

K∑

k=1

pi,k log φ(xi | µ′k, Σ

′k)

• Note that the prior probabilitiesa′k and the parametersof the Gaussian componentsµ′

k, Σ′k can be optimized

separately.

34

• Thea′k’s subject to∑K

k=1 a′k = 1. Basic optimizationtheories show thata′k are optimized by

a′k =

∑ni=1 pi,k

n.

• The optimization ofµk andΣk is simply a maximumlikelihood estimation of the parameters using samplesxi with weightspi,k. Basic optimization techniquesalso lead to

µ′k =

∑ni=1 pi,kxi∑n

i=1 pi,k

Σ′k =

∑ni=1 pi,k(xi − µ′

k)(xi − µ′k)

t

∑ni=1 pi,k

• After every iteration, the likelihood functionL is guar-anteed to increase (may not strictly).

• We have derived the EM algorithm for a mixture ofGaussians.

35

EM Algorithm for the Mixture of Gaussians

Parameters estimated at thepth iteration are marked bya superscript(p).

1. Initialize parameters

2. E-step: Compute the posterior probabilities for alli =1, ..., n, k = 1, ...,K.

pi,k =a

(p)k φ(xi | µ

(p)k , Σ

(p)k )

∑Kk=1 a

(p)k φ(xi | µ

(p)k , Σ

(p)k )

.

3. M-step:

a(p+1)k =

∑ni=1 pi,k

n

µ(p+1)k =

∑ni=1 pi,kxi∑n

i=1 pi,k

Σ(p+1)k =

∑ni=1 pi,k(xi − µ

(p+1)k )(xi − µ

(p+1)k )t

∑ni=1 pi,k

4. Repeat step2 and3 until converge.

Comment: for mixtures of other distributions, the EMalgorithm is very similar. The E-step involves comput-

36

ing the posterior probabilities. Only the particular dis-tribution φ needs to be changed. The M-step always in-volves parameter optimization. Formulas differ accord-ing to distributions.

37

Computation Issues

• If a different Σk is allowed for each component, thelikelihood function is not bounded. Global optimumis meaningless. (Don’t overdo it!)

• How to initialize? Example:

– Apply k-means first.

– Initializeµk andΣk using all the samples classifiedto clusterk.

– Initialize ak by the proportion of data assigned toclusterk by k-means.

• In practice, we may want to reduce model complex-ity by putting constraints on the parameters. For in-stance, assume equal priors, identical covariance ma-trices for all the components.

38

Examples

• The heart disease data set is taken from the UCI ma-chine learning database repository.

• There are297 cases (samples) in the data set, of which137 have heart diseases. Each sample contains13quantitative variables, including cholesterol, max heartrate, etc.

• We remove the mean of each variable and normalizeit to yield unit variance.

• data are projected onto the plane spanned by the twomost dominant principal component directions.

• A two-component Gaussian mixture is fit.

39

−4 −3 −2 −1 0 1 2 3 4 5−4

−3

−2

−1

0

1

2

3

−4 −3 −2 −1 0 1 2 3 4 5−4

−3

−2

−1

0

1

2

3

Figure 6: The heart disease data set and the estimated cluster densi-ties. Top: The scatter plot of the data. Bottom: The contour plot ofthe pdf estimated using a single-layer mixture of two normals. Thethick lines are the boundaries between the two clusters based on theestimated pdfs of individual clusters.

40

Recommended