Upload
john-curry
View
222
Download
0
Embed Size (px)
Citation preview
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 1/46
Prof. M-T. KechadiSchool of Computer Science & Informatics, UCD
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 2/46
Learning Outcomes
Understand Clustering Analysis
Types of Data in Clustering Analysis
Major Clustering MethodsPartitioning Methods
Hierarchical Methods
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 3/46
Clustering AnalysisCluster Collection of data objects similar to one another within
the same cluster
Dissimilar to the objects in other clusters
Clustering analysis Grouping a set of data objects into clusters
Clustering is unsupervised classification
no predefined classesTypical applications Stand-alone tool to get insight into data distribution
Pre-processing step for other algorithms
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 4/46
Clustering Applications
Pattern Recognition
Spatial Data Analysis create thematic maps in GIS by clustering feature
spaces
detect spatial clusters and explain them in spatial datamining
Image Processing
Economic Science
WWW Document classification
Cluster Weblog data to discover groups of similar
access patterns
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 5/46
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 6/46
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 7/46
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 8/46
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 9/46
Types of data in clustering analysis
Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 10/46
Interval-valued variables
Standardise data
Calculate the mean absolute deviation:
where
Calculate the standardised measurement (z-score )
Mean absolute deviation is more robust than using
standard deviation
.)...21
1nf f f f
x x(xnm
|)|...|||(|121 f nf f f f f f
m xm xm xns
zif
xif
m f
s f
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 11/46
Similarity & Dissimilarity Between Objects
Distance
is normally used to measure the similarity or dissimilaritybetween two data objects
Some popular ones include
Minkowski distance :
where i = (x i1, x i2, …, x ip) and j = (x j1, x j2, …, x jp) are two p -
dimensional data objects, and q is a positive integer
Manhattan distance ( q = 1)
p p
j x
i x
j x
i x
j x
i x jid )||...|||(|),(
2211
||...||||),(2211 p p j
x
i
x
j
x
i
x
j
x
i
x jid
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 12/46
Similarity & Dissimilarity Between Objects
Euclidean distance (q = 2 )
Properties
d(i,j) 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) d(i,k) + d(k,j)
Also one can use weighted distance, parametric Pearson
product moment correlation, or other dissimilarity
measures
)||...|||(|),( 22
22
2
11 p p j xi x j xi x j xi x jid
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 13/46
Binary VariablesA contingency table for binary data
Simple matching coefficient (invariant, if the binary
variable is symmetric ):
Jaccard coefficient (non invariant if the binary variable is
asymmetric ):
d cbacb
jid
),(
cbacb jid
),(
pd bcasum
d cd c
baba
sum
0
1
01
Object i
Object j
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 14/46
Dissimilarity between Binary VariablesExample
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value N be set to 0
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 15/46
Nominal VariablesCategorical variables
Can take more than 2 states, e.g., red, yellow, blue, green
Method 1
Simple matching
m : number of matches, p : total number of variables
Method 2 use a large number of binary variables
creating a new binary variable for each of the M nominal states
pm p
jid
),(
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 16/46
Ordinal VariablesAn ordinal variable can be discrete or continuous
The order is important, e.g., rankCan be treated like interval-scaled
replacing x if by their rank
map the range of each variable onto [0, 1] by replacingi -th object in the j-th variable by
compute the dissimilarity using methods for interval-
scaled variables
1
1
f
if
if M
r z
},...,1{ f if
r
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 17/46
Ratio-Scaled Variables
Ratio-scaled variable
a positive measurement on a nonlinear scale, approximately atexponential scale
E.g. Ae Bt or Ae -Bt
Methods treat them like interval-scaled variables — not a good choice!
apply logarithmic transformation
y if = log(x if ) treat them as continuous ordinal data and treat their rank as
interval-scaled
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 18/46
Variables of Mixed TypesA database may contain all the six types of variables
symmetric binary, asymmetric binary, nominal, ordinal, interval
and ratioOne may use a weighted formula to combine their effects
f is binary or nominal:dij
(f) = 0 if xif = x jf , or dij(f) = 1 o.w.
f is interval-based: use the normalised distance
f is ordinal or ratio-scaled
compute ranks rif andand treat zif as interval-scaled
)(
1
)()(
1),( f
ij
p
f
f ij
f ij
p f
d jid
1
1
f
if
M r
zif
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 19/46
Major Clustering TechniquesPartitioning algorithms
Construct various partitions and then evaluate them by some
criterionHierarchy algorithms
Create a hierarchical decomposition of the set of data (or objects)using some criterion
Density-based
based on connectivity and density functions
Grid-based
based on a multiple-level granularity structure
Model-based
A model is hypothesised for each of the clusters and the idea is tofind the best fit of that model to each other
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 20/46
Partitioning Algorithms: Basic Concept
Partitioning method
Partition a database D of n objects into a set of k clusters
Problem
Given k , find a partition of k clusters that optimises the chosen
partitioning criterion
Remarks
Global optimal: exhaustively enumerate all partitions (!!!)
Heuristic methods: k-means and k-medoids algorithms
k-means : Each cluster is represented by the centre of the cluster
k-medoids or PAM (Partition around medoids) : Each cluster is
represented by one of the objects in the cluster
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 21/46
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 22/46
The K-Means Clustering Method
Example
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 23/46
K-Means MethodStrength Relatively efficient : O (tkn ), where n , k, and t are the
number of objects, number of clusters, and numberiterations respectively. Normally, k , t << n .
Often terminates at a local optimum
The global optimum may be found using techniques suchas: genetic algorithms
Weakness Applicable only when mean is defined, then what about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 24/46
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 25/46
The K -Medoids Clustering Method
Find representative objects, called medoids, in
clustersPAM (Partitioning Around Medoids)
Start from an initial set of medoids and iteratively
replace one of the medoids by one of the non-medoids
if it improves the total distance of the resulting
clustering
PAM works effectively for small data sets, but does notscale well for large data sets
Focusing + spatial data structure
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 26/46
PAM (Partitioning Around Medoids)
Use real object to represent the cluster
Select k representative objects arbitrarily
For each pair of non-selected object h and selected
object i , calculate the total swapping cost TC ih
For each pair of i and h ,
If TC ih < 0, i is replaced by h
Then assign each non-selected object to the mostsimilar representative object
Repeat steps 2-3 until there is no change
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 27/46
PAM Clustering: Total swapping cost TC ih =
j C jih
j
i h
t t
i h j
h
i t
j
t
i h j
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 28/46
Hierarchical Clustering
Use distance matrix as clustering criteria. This
method does not require the number of clustersk as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
aa b
d ec d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 29/46
AGNES (Agglomerative Nesting)
Use the Single-Link method and the dissimilaritymatrix
Merge nodes that have the least dissimilarity
Continue in a non-descending fashion
Eventually all nodes belong to the same cluster
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 30/46
A Dendrogram Shows How the Clusters are Merged
Hierarchically
Decompose data objects into a several levels of nested partitioning
(tree of clusters), called a dendrogram
A clustering of the data objects is obtained by cutting the dendrogramat the desired level, then each connected component forms a cluster
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 31/46
DIANA (Divisive Analysis)
Inverse order of AGNES
Eventually each node forms a cluster on its
own
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 32/46
Hierarchical Clustering Methods
Major weakness of agglomerative clustering
methods do not scale well: time complexity of at least O (n 2 ),
where n is the total number of objects
can never undo what was done previously
Integration of hierarchical with distance-basedclustering BIRCH: uses CF-tree and incrementally adjusts the
quality of sub-clusters CURE: selects well-scattered points from the cluster
and then shrinks them towards the centre of the clusterby a specified fraction
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 33/46
Clustering Feature Vector
Clustering Feature: CF = (N, LS, SS)
N : Number of data points
LS: N i=1=X i : sum of the points in the cluster
SS: N
i=1=X i2
CF = (5, (16,30),(54,190))
(3,4)
(2,6)
(4,5)(4,7)
(3,8)
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 34/46
CF TreeCF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6prev next CF1 CF2 CF4
prev next
B = 6
L = 7
Root
Non-leaf node
Leaf node Leaf node
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 35/46
CURE (Clustering Using Representatives)
The main idea of CURE
Stops the creation of a cluster hierarchy if a level
consists of k clusters
Uses multiple representative points to evaluate the
distance between clusters, adjusts well to arbitrary
shaped clusters and avoids single-link effect
Drawbacks of square-error based clusteringmethod
Consider only one point as representative of a cluster
Good only for convex shaped, similar size and density,and if k can be reasonably estimated
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 36/46
Cure: The Algorithm
Draw random sample s
Partition sample to p partitions with size s/p
Partially cluster partitions into s/pq clusters
Eliminate outliers
By random sampling
If a cluster grows too slow, eliminate it
Cluster partial clusters
Label data in disk
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 37/46
Data Partitioning and Clustering s = 50
p = 2
s/p = 25
xx
x
y
y y
y
x
y
x
s/pq = 5
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 38/46
Cure: Shrinking Representative Points
Shrink the multiple representative points towardsthe gravity centre by a fraction of .
Multiple representatives capture the shape of thecluster
x
y
x
y
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 39/46
Clustering Categorical Data: ROCK
ROCK: Robust Clustering using linKs
Use links to measure similarity/proximity Not distance based
Computational complexity:
Basic ideas
Similarity function and neighbours:
Let T 1 = {1,2,3}, T 2 ={3,4,5}
O n nm m n nm a( log )2 2
Sim T T T T
T T
( , )1 2
1 2
1 2
Sim T T ( , ){ }
{ , , , , }.1 2
3
1 2 3 4 5
1
50 2
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 40/46
Rock AlgorithmLinks: The number of common neighbours for thetwo points
Algorithm
Draw random sample (get a random sample of the data) Cluster the data using the link agglomerative approach
Label data in disk
{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}
{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}
{1,2,3} {1,2,4}3
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 41/46
ExampleConsider a system where documents may
contain keywords {book, water, sun, sand, swim,read}. Suppose there are 4 documents, where
1st contains the word {book},
2nd
contains {water, sun, sand, swim}, 3rd contains {water, sun, swim, read}, and
4th contains {read, sand}.
Use the ROCK clustering algorithm to clusterthese documents.
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 42/46
Density-Based Clustering MethodsClustering based on density (local cluster
criterion), such as density-connected points
Major features: Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 43/46
Density-Based Clustering: Background
Two parameters:
Eps : Maximum radius of the neighbourhood
MinPts : Minimum number of points in an Eps-neighbourhood ofthat point
N Eps (p) : {q belongs to D | dist(p,q) <= Eps}
Directly density-reachable: A point p is directly density-reachable from a point q with regard to Eps , MinPts if
1) p belongs to N Eps (q)
2) core point condition:
|N Eps (q) | >= MinPts p
q
MinPts = 5
Eps = 1 cm
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 44/46
Density-Based Clustering: Background (II)
Density-reachable:
A point p is density-reachable from
a point q with regard to Eps ,MinPts if there is a chain of pointsp 1, …, p n , p 1 = q , p n = p such thatp i+1 is directly density-reachable
from p i
Density-connected
A point p is density-connected to apoint q with regard to Eps , MinPts if there is a point o such that both,p and q are density-reachablefrom o with regard to Eps andMinPts .
p
q p1
p q
o
DBSCAN: Density Based Spatial
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 45/46
DBSCAN: Density Based Spatial
Clustering of Applications with Noise
Relies on a density-based notion of cluster: A cluster isdefined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Core
Border
Outlier
Eps = 1cm
MinPts = 5
8/2/2019 DM Lecture08 12 Clustering
http://slidepdf.com/reader/full/dm-lecture08-12-clustering 46/46
DBSCAN: The Algorithm Arbitrary select a point p
Retrieve all points density-reachable from p with
regard to Eps and MinPts .
If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the
database.
Continue the process until all of the points have been
processed.