Upload
makara
View
109
Download
4
Embed Size (px)
DESCRIPTION
CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling. Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲 Data : 2001/12/18. About this paper …. Department of Computer Science and Engineering , University of Minnesota George Karypis Eui-Honh (Sam) Han - PowerPoint PPT Presentation
Citation preview
2001/12/18 CHAMELEON 1
CHAMELEON:A Hierarchical Clustering Algorithm Using
Dynamic Modeling
Paper presentation in data mining class
Presenter : 許明壽 ; 蘇建仲Data : 2001/12/18
2001/12/18 CHAMELEON 2
About this paper …
Department of Computer Science and Engineering , University of Minnesota George Karypis Eui-Honh (Sam) Han Vipin Kumar
IEEE Computer Journal - Aug. 1999
2001/12/18 CHAMELEON 3
Outline
Problems definition Main algorithm Keys features of CHAMELEON Experiment and related worked Conclusion and discussion
2001/12/18 CHAMELEON 4
Problems definition
Clustering Intracluster similarity is maximized Intercluster similarity is minimized
Problems of existing clustering algorithms Static model constrain Breakdown when clusters that are of diverse shap
es, densities, and sizes Susceptible to noise , outliers , and artifacts
2001/12/18 CHAMELEON 5
Static model constrain Data space constrain
K means , PAM … etc Suitable only for data in metric spaces
Cluster shape constrain K means , PAM , CLARANS
Assume cluster as ellipsoidal or globular and are similar sizes Cluster density constrain
DBScan Points within genuine cluster are density-reachable and point across
different clusters are not Similarity determine constrain
CURE , ROCK Use static model to determine the most similar cluster to merge
2001/12/18 CHAMELEON 6
Partition techniques problem
(a) Clusters of widely different sizes (b) Clusters with convex shapes
2001/12/18 CHAMELEON 7
Hierarchical technique problem (1/2) The {(c) , (d)} will be choose to merge when we only consider closeness
2001/12/18 CHAMELEON 8
Hierarchical technique problem (2/2) The {(a) , (c)} will be choose to merge when we only consider inter-
connectivity
2001/12/18 CHAMELEON 9
Main algorithm
Two phase algorithm PHASE I
Use graph partitioning algorithm to cluster the data items into a large number of relatively small sub-clusters.
PHASE II Uses an agglomerative hierarchical clustering
algorithm to find the genuine clusters by repeatedly combining together these sub-clusters.
2001/12/18 CHAMELEON 10
Framework
Construct
Sparse Graph Partition the Graph
Merge Partition
Final Clusters
Data Set
2001/12/18 CHAMELEON 11
Keys features of CHAMELEON
Modeling the data Modeling the cluster similarity Partition algorithms Merge schemes
2001/12/18 CHAMELEON 12
Terms
Arguments needed K
K-nearest neighbor graph MINSIZE
The minima size of initial cluster TRI
Threshold of related inter-connectivity TRC
Threshold of related intra-connectivity α
Coefficient for weight of RI and RC
2001/12/18 CHAMELEON 13
Modeling the data
K-nearest neighbor graph approach Advantages
Data points that are far apart are completely disconnected in the Gk
Gk capture the concept of neighborhood dynamically
The edge weights of dense regions in Gk tend to be large and the edge weights of sparse tend to be small
2001/12/18 CHAMELEON 14
Example of k-nearest neighbor graph
2001/12/18 CHAMELEON 15
Modeling the clustering similarity (1/2)
Relative interconnectivity
Relative closeness2
||||||
),(},{
CjCi
CjCi
ECECEC
CjCiRI
CjCi
CjCi
ECEC
EC
SCjCi
CjS
CjCiCi
SCjCiRC
||||||
||||||
),(},{
2001/12/18 CHAMELEON 16
Modeling the clustering similarity (2/2)• If related is considered , {(c) , (d)} will be merged
2001/12/18 CHAMELEON 17
Partition algorithm (PHASE I)
What Finding the initial sub-clusters
Why RI and RC can’t be accurately calculated for clusters contai
ning only a few data points
How Utilize multilevel graph partitioning algorithm (hMETIS)
Coarsening phase Partitioning phase Uncoarsening phase
2001/12/18 CHAMELEON 18
Partition algorithm (cont.)
Initial all points belonging to the same cluster
Repeat until (size of all clusters < MINSIZE) Select the largest cluster and use hMETIS to bise
ct Post scriptum
Balance constrain Spilt Ci into CiA and CiB and each sub-clusters contains
at least 25% of the node of Ci
2001/12/18 CHAMELEON 19
2001/12/18 CHAMELEON 20
What Merging sub-clusters using a dynamic framework
How Finding and merging the pair of sub-clusters that are the
most similar Scheme 1
Scheme 2
Merge schemes (Phase II)
),(*),( jiji CCRCCCRI
RIji TCCRI ),( and RCji TCCRC ),(
2001/12/18 CHAMELEON 21
Experiment and related worked
Introduction of CURE Introduction of DBSCAN Results of experiment Performance analysis
2001/12/18 CHAMELEON 22
Introduction of CURE (1/n)
Clustering Using Representative points
1. Properties : Fit for non-spherical shapes. Shrinking can help to dampen the effects of outliers. Multiple representative points chosen for non-spherical Each iteration , representative points shrunk ratio related
to merge procedure by some scattered points chosen Random sampling in data sets is fit for large databases
2001/12/18 CHAMELEON 23
Introduction of CURE (2/n)
2. Drawbacks : Partitioning method can not prove data points chosen are
good. Clustering accuracy with respect to the parameters below :
(1) Shrink factor s : CURE always find the right clusters by range of s values from 0.2 to 0.7.
(2) Number of representative points c : CURE always found right clusters for value of c greater than 10.
(3) Number of Partitions p : with as many as 50 partitions , CURE always discovered the desired clusters.
(4) Random Sample size r : (a) for sample size up to 2000 , clusters found poor quality (b) from 2500 sample points and above , about 2.5% of the
data set size , CURE always correctly find the clusters.
2001/12/18 CHAMELEON 24
3. Clustering algorithm : Representative points
2001/12/18 CHAMELEON 25
• Merge procedure
2001/12/18 CHAMELEON 26
Introduction of DBSCAN (1/n) Density Based Spatial Clustering of Application With
Noise1. Properties :
Can discovery clusters of arbitrary shape. Each cluster with a typical density of points which is higher
than outside of cluster. The density within the areas of noise is lower than the dens
ity in any of the clusters. Input the parameters MinPts only Easy to implement in C++ language using R*-tree Runtime is linear depending on the number of points. Time complexity is O(n * log n)
2001/12/18 CHAMELEON 27
Introduction of DBSCAN (2/n)
2. Drawbacks : Cannot apply to polygons. Cannot apply to high dimensional feature spaces. Cannot process the shape of k-dist graph with multi-featur
es. Cannot fit for large database because no method applied t
o reduce spatial database.
3. Definitions Eps-neighborhood of a point p
NEps(p)={q€D | dist(p,q)<=Eps} Each cluster with MinPts points
2001/12/18 CHAMELEON 28
Introduction of DBSCAN (3/n)
4. p is directly density-reachable from q(1) p€ NEps(q) and
(2) | NEps(q) | >=MinPts (core point condition) We know directly density-reachable is symmetric when p
and q both are core point , otherwise is asymmetric if one core point and one border point.
5. p is density-reachable from q if there is a chain of points between p and q
Density-reachable is transitive , but not symmetric Density-reachable is symmetric for core points.
2001/12/18 CHAMELEON 29
Introduction of DBSCAN (4/n)
6. A point p is density-connected to a point q if there is a point s such that both p and q are density-reachable from s.
Density-connected is symmetric and reflexive relation A cluster is defined to be a set of density-connected points
which is maximal density-reachability. Noise is the set of points not belong to any of clusters.
7. How to find cluster C ? Maximality
∆ p , q : if p€ C and q is density-reachable from p , then q € C Connectivity
∆ p , q € C : p is density-connected to q
8. How to find noises ? ∆ p , if p is not belong to any clusters , then p is noise point
2001/12/18 CHAMELEON 30
Results of experiment
2001/12/18 CHAMELEON 31
Performance analysis (1/2) The time of construct the k-nearest neighbor
Low-dimensional data sets based on k-d trees , overall complexity of O(n log n)
High-dimensional data sets based on k-d trees not applicable , overall complexity of O(n2)
Finding initial sub-clusters Obtains m clusters by repeated partitioning successively s
maller graphs , overall computational complexity is O(n log (n/m))
Is bounded by O(n log n) A faster partitioning algorithm to obtain the initial m cluster
s in time O(n+m log m) using multilevel m-way partitioning algorithm
2001/12/18 CHAMELEON 32
Performance analysis (2/2)
Merging sub-clusters using a dynamic framework
The time of compute the internal inter-connectivity and internal closeness for each initial cluster is which is O(nm)
The time of the most similar pair of clusters to merge is O(m2 log m) by using a heap-based priority queue
So overall complexity of CHAMELEON’s is O(n log n + nm + m2 log m)
2001/12/18 CHAMELEON 33
Conclusion and discussion
Dynamic model with related interconnectivity and closeness
This paper ignore the issue of scaling to large data
Other graph representation methodology?? Other Partition algorithm??