58
DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

  • Upload
    henry

  • View
    23

  • Download
    1

Embed Size (px)

DESCRIPTION

DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University. CLUSTERING AND CLUSTER ANALYSIS. Data Mining Lecture IV [Chapter 8: sections 8.4 and Chapter 9 from Principles of Data Mining by Hand,, Manilla, Smyth ]. Shoulder. E. K. - PowerPoint PPT Presentation

Citation preview

Page 1: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

DATA MINING van data naar informatie

Ronald WestraDep. MathematicsMaastricht University

Page 2: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

CLUSTERING AND CLUSTER ANALYSIS

Data Mining Lecture IV[Chapter 8: sections 8.4 and Chapter 9 from Principles

of Data Mining by Hand,, Manilla, Smyth ]

Page 3: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

1. Clustering versus Classification

• classification: give a pre-determined label to a sample

• clustering: provide the relevant labels for classification from structure in

a given dataset

• clustering: maximal intra-cluster similarity and maximal inter-cluster dissimilarity

• Objectives: - 1. segmentation of space

- 2. find natural subclasses

Page 4: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Examples of Clustering

and Classification

1. Computer Vision

Page 5: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Examples of Clustering

and Classification: 1. Computer Vision

Page 6: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Example of Clustering

and Classification: 1. Computer Vision

Page 7: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Examples of Clustering and Classification:

2. Types of chemical reactions

Page 8: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Examples of Clustering and Classification:

2. Types of chemical reactions

Page 9: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Voronoi Clustering

Georgy Fedoseevich Voronoy

1868 - 1908

Page 10: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Voronoi Clustering

A Voronoi diagram (also called a Voronoi tessellation, Voronoi decomposition, Dirichlet tessellation), is a special kind of decomposition of a metric space determined by distances to a specified discrete set of objects in the space, e.g., by a discrete set of points.

Page 11: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Voronoi Clustering

Page 12: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Voronoi Clustering

Page 13: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Voronoi Clustering

Page 14: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Voronoi

Clustering

Page 15: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Voronoi Clustering

Page 16: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Voronoi Clustering

Page 17: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Voronoi Clustering

Page 18: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Partitional Clustering [book section 9.4]

score-functions

centroid

intra-cluster distance

inter-cluster distance

C-means [book page 303]

Page 19: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

k-means clustering (also: C-means)

The k-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster, ie its coordinates is the arithmetic mean for each dimension separately for all the points in the cluster.

Page 20: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

k-means clustering (also: C-means)

Example: The data set has three dimensions and the cluster has two points: X = (x1, x2, x3) and Y = (y1, y2, y3).

Then the centroid Z becomes Z = (z1, z2, z3), where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2 and z3 = (x3 + y3)/2

Page 21: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

k-means clustering (also: C-means)

This is the basic structure of the algorithm (J. MacQueen, 1967):

•Randomly generate k clusters and determine the

cluster centers or directly generate k seed points as

cluster centers

•Assign each point to the nearest cluster center.

•Recompute the new cluster centers.

•Repeat until some convergence criterion is met

(usually that the assignment hasn't changed).

Page 22: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

C-means [book page 303]

while changes in cluster Ck

% form clusters

for k=1,…,K do

Ck = {x | ||x – rk|| < || x – rl|| }

end

% compute new cluster centroids

for k=1,…,K do

rk = mean({x | x Ck })

end

end

Page 23: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

k-means clustering (also: C-means)

The main advantages of this algorithm are its simplicity and speed, which allows it to run on large datasets. Yet it does not systematically yield the same result with each run of the algorithm. Rather, the resulting clusters depend on the initial assignments. The k-means algorithm maximizes inter-cluster (or minimizes intra-cluster) variance, but does not ensure that the solution given is not a local minimum of variance.

Page 24: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

k-means clustering

Page 25: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

k-means clustering (also: C-means)

Page 26: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

k-means clustering (also: C-means)

Page 27: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

k-means clustering (also: C-means)

Page 28: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Fuzzy c-means

One of the problems of the k-means algorithm is that it gives a hard partitioning of the data, that is to say that each point is attributed to one and only one cluster. But points on the edge of the cluster, or near another cluster, may not be as much in the cluster as points in the center of cluster.

Page 29: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Fuzzy c-means

Therefore, in fuzzy clustering, each point does not pertain to a given cluster, but has a degree of belonging to a certain cluster, as in fuzzy logic. For each point x we have a coefficient giving the degree of being in the k-th cluster uk(x). Usually, the sum of those coefficients has to be one, so that uk(x) denotes a probability of belonging to a certain cluster:

Page 30: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Fuzzy c-means

With fuzzy c-means, the centroid of a cluster is computed as being the mean of all points, weighted by their degree of belonging to the cluster, that is:

Page 31: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Fuzzy c-meansThe degree of being in a certain cluster is related to the inverse of the distance to the cluster

then the coefficients are normalized and fuzzyfied with a real parameter m > 1 so that their sum is 1. So :

Page 32: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Fuzzy c-means

For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1. When m is close to 1, then cluster center closest to the point is given much more weight than the others, and the algorithm is similar to k-means.

Page 33: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Fuzzy c-means

The fuzzy c-means algorithm is greatly similar to the k-means algorithm :

Page 34: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Fuzzy c-means

•Choose a number of clusters •Assign randomly to each point coefficients for being in the clusters •Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) : •Compute the centroid for each cluster, using the formula above •For each point, compute its coefficients of being in the clusters, using the formula above

Page 35: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Fuzzy C-means

uij is membership of sample i to custer j

ck is centroid of custer i

while changes in cluster Ck

% compute new memberships

for k=1,…,K do

for i=1,…,N do

ujk = f(xj – ck)

end

end

% compute new cluster centroids

for k=1,…,K do

% weighted mean

ck = SUMj jkxk xj /SUMj ujk

end

end

Page 36: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Fuzzy c-means

The fuzzy c-means algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is local minimum, and the results depend on the initial choice of weights.

Page 37: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Fuzzy c-means

Page 38: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Fuzzy c-means

Page 39: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Fuzzy c-means

Page 40: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Fuzzy

c-means

0 0.2 0.4 0.6 0.8 1-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Trajectory of Fuzzy MultiVariate Centroids

1

2345

1

2

3

4

5

0 0.2 0.4 0.6 0.8 1-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Trajectory of Fuzzy C-means Centroids

1

2345

1

2

3

4

5

Page 41: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

The Correct Number of Clusters

Algorithms like C-means and fuzzy C-means need the “correct” number K of clusters in your data set.

In realistic cases it is mostly impossible to define what this number K should be.

Therefore, the following approach is often used.

Page 42: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

The Correct Number of Clusters

The sum all distances between points and their respective centroid

k Ci

ki

k

cxE

Page 43: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

The Correct Number of Clusters

Now plot this error E as function of the number of clusters K

Shoulder

K

E

Page 44: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

The Correct Number of Clusters

Remark that the number of errors is minimal when K reflects the natural number of clusters in your data set.

Now, hoe to define the error of your clustering?

A solution is to sum all distances between points and their respective centroid

Page 45: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Hierarchical Clustering [book section 9.5]

One major problem with partitional clustering is that the number of clusters (= #classes) must be pre-specified !!!

This poses the question: what IS the real number of clusters in a given set of data?

Answer: it depends!

• Agglomerative methods: bottom-up

• Divisive methods: top-down

Page 46: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Hierarchical Clustering Agglomerative hierarchical clustering

Page 47: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Hierarchical Clustering

Page 48: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Hierarchical

Clustering

Page 49: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Hierarchical Clustering

Page 50: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Hierarchical Clustering

Page 51: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Introduction to Bioinformatics7.3 INFERRING TREES

7.3 Inferring trees

* n taxa {t1,…,tn}

* D matrix of pairwise genetic distances + JC-correction

* Additive distances: distance over path from i → j is: d(i,j)

* (total) length of a tree: sum of all branch lengths.

Page 52: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Introduction to Bioinformatics7.3 INFERRING TREES

Finding Branche lengths:

Three-point formula:

Lx + Ly = dAB

Lx + Lz = dAC

Ly + Lz = dBC

Lx = (dAB+dAC-dBC)/2Ly = (dAB+dBC-dAC)/2Lz = (dAC+dBC-dAB)/2

A

B

centre

CLx

Ly

Lz

Page 53: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Introduction to Bioinformatics7.3 INFERRING TREES

Four-point formula:

d(1,2) + d(i,j) < d(i,1) + d(2,j)

Ri = ∑j d(ti ,tj)

M(i,j) = (n-2)d(i,j) – Ri – Rj

M(i,j) < M(i,k) for all k not equal to j

1

2

centre

iLx

Ly

Lz

jLq

when (1,2) and (i,j) are neighbor-couples !Four-point condition

Minimize d(i,j) AND total distance in tree

If i and j are neighbours!

Page 54: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

NJ algorithm:

Input: nxn distance matrix D and an outgroupOutput: rooted phylogenetic tree T

Step 1: Compute new table M using D – select smallest value of M to select two taxa to join

Step 2: Join the two taxa ti and tj to a new vertex V - use 3-point formula to calculate the updates distance matrix D’ where ti and tj are replaced by V .

Step 3: Compute branch lengths from tk to V using 3-point formula, T(V,1) = ti and T(V,2) = tj and TD(ti) = L(ti,V) and TD(ti) = L(ti,V).

Step 4: The distance matrix D’ now contains n – 1 taxa. If there are more than 2 taxa left go to step 1. If two taxa are left join them by an branch of length

d(ti,tj).

Step 5: Define the root node as the branch connecting the outgroup to the rest of the tree. (Alternatively, determine the so-called “mid-point”)

Page 55: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Introduction to Bioinformatics7.3 INFERRING TREES

UPGMA and ultrametric trees:

If the distance from the root to all leafs is equal the tree is ultrametric

In that case we can use D instead of M and the algorithm is called UPGMA (Unweighted Pair Group Method)

Ultrametricity must be valid for the real tee, bur due to noise this condition will in practice generate erroneous trees.

Page 56: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Example of Clustering

and Classification

Page 57: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

1. Clustering versus Classification

• classification: give a pre-determined label to a sample

• clustering: provide the relevant labels for classification from structure in

a given dataset

• clustering: maximal intra-cluster similarity and maximal inter-cluster dissimilarity

• Objectives: - 1. segmentation of space

- 2. find natural subclasses

Page 58: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University