KHALIQ PAPER

Embed Size (px)

Citation preview

  • 8/8/2019 KHALIQ PAPER

    1/4

  • 8/8/2019 KHALIQ PAPER

    2/4

    II. BASIC CONCEPTSAND DEFINITIONSA silhouette [4] is a function that measures the

    similarity of an object with objects of its own clustercompared with the objects of other clusters.

    For a cluster C consisting of data points p1, p2 pn, theradius r of C is defined as formula(1), where c is centroid ofC and d(p

    i, c) is the Euclidean distance between p

    iand c.

    ( )

    1

    22

    1

    1 n

    i

    i

    r d p cn =

    =

    (1)

    ( , ) exp( ( , ) ( , ) / )i j i i i j i join p c d p c d p c r = (2)

    Where join(pi, cj) is the intention of p i to be joined intoCj.The cohesion [2] of Ci and Cj is calculated as formula(3).

    ( )

    ( , ) ( , )

    ,

    j i

    p Ci p Cji j

    i j

    join p C join p C

    Chs C CC C

    +

    =+

    (3)

    Definition 1 IS (Improved Silhouette)Let S be a dataset consisting of clusters C1, C2Ct. The

    distance between each object oi(oiCj, j [1,t]) and thecentroid of its own cluster is denoted as ai. bi is theminimum distance between oi and each centroid of theother t-1 clusters. The IS(oi) is defined as formula(4).

    ( ) ( ) / max( , )i i i i i IS o b a a b= (4)

    In formula (4), the meanings of ai and bi have changedcompared with the traditional silhouette computation. Bothai and bi denote the distance to the cluster centroid.

    The average IS of dataset corresponding to different partition is calculated. The maximal IS of the datasetcorresponds to the optimal partition of the dataset.

    We take point A in Fig. 1 as an example to show the IScomputation of a data point.

    (1)Obtain the centroids of clusters C1, C2, C3respectively: Centroid1=(1.4, 1.2), Centroid2=(4.4, 4.8),Centroid3=(6.5, 0.8333);

    (2)Calculate aA, the distance between A and the centroid

    of its own cluster: 2 2(1 1.4) (0 1.2)Aa = + =1.2649. The

    distances between A and each centroid of C2 and C3 can be

    obtained similarly, and they are 5.8822 and 5.5628respectively. Since bA denotes the minimum distanceaccording to the definition of IS, thus let bA=5.8822.

    (3)The IS of A can be obtained based on formula (4).IS(A)=(bA-aA)/max(aA, bA)=0.7850.Definition 2 CUCMC (Constraint-based Update ofCohesion Matrix between Clusters)

    Suppose that { }1

    n

    k kC

    =is the set of given clusters and

    X=[ ]( , )s t n nChs C C is the existing cohesion matrix between

    any two clusters. Let M= {(Ci, Cj)} be the set of must-linkconstraints, indicating that cluster Ci and Cj should be in thesame class, C= {(Ci, Cj)} be the set of cannot-linkconstraints, indicating that Ci and Cj should be in thedifferent classes. For clusters Cp, Cq, Cr(p, q, r [1,n]), inorder to satisfy M, Chs(Cp, Cq) in X is updated to 1, andChs(Cp(Cq), Cr)is updated to max(Chs(Cp, Cr), Chs(Cq, Cr)).

    And Chs(Cp, Cq) in X is updated to 0 in order to satisfy C.In Fig. 2, we give an example to show the process of

    CUCMC, where must-link constraint M (C1, C2) is known.From Fig. 2, we can obtain that Chs(C1, C2)=0.4,

    Chs(C1, C3)=0.2, Chs(C2, C3)=0.1 respectively withoutconsidering the existing constraint. Since C1, C2 need tosatisfy M (C1, C2), Chs(C1, C2) in X, the cohesion matrix, isupdated to 1, and the Chs(C1, C3) is updated to 0.2.

    The CUCMC of the example is as follows:

    X=

    1

    1.01

    2.04.01

    X=

    1

    2.01

    2.011

    In general, the existing constraints are in the form ofm={(xi, xj)}, c={(xi, xj)}, where m indicates that point xiand xj should be in the same cluster, and c indicates that

    point xi and xj should be in the different clusters. The M={(Ci, Cj)} and C= {(Ci, Cj)} can be obtained throughutilizing the propagation of constraints.

    Penalty factor w, w are introduced in order to addressthe constraints violation.

    ( , )( , )( , ) ( , ) i ji j C C Ci j i j M C C Sim C C Chs C C w w= (5)

    ( , )i jM C Cw

    works as must-link (Ci, Cj) is violated,

    and ( , )i jC C Cw works as Cannot-link (Ci, Cj) is violated.

    8

    x

    C1

    C2

    C3

    A(1,0) 2

    y

    4 80

    6

    4

    2

    6

    Figure 1. The IS computation of data point

    0.2 0.1

    C3

    C2C2

    C3

    C1 C1

    0.4

    0.2

    Figure 2. The update of cohesion matrix of cluster with constraints

    1480

  • 8/8/2019 KHALIQ PAPER

    3/4

    III. HIERARCHICAL CLUSTERING ALGORITHMBASED ONK-MEANS WITH CONSTRAINTS

    The CSM [2] needs to specify K. Different K leads todifferent clustering results. Thus, how to determine theappropriate K becomes especially important. Besides, theexisting constraints are not considered in CSM, so the

    accuracy of the clustering results will not be high.In HCAKC, we plot the curve about the average IS ofdataset to be clustered and the number of partitions. Theoptimal number of clusters is determined by the maximumof the curve, since the average IS of a dataset not onlyreflects the density of clusters, but also the dissimilarity

    between clusters. The cohesion matrix X is constructedaccording to the cohesion between any two clusters. Theexisting pairwise constraints are incorporated into thehierarchical clustering. CUCMC is implemented based onthe existing constraints. Thus, the clustering results aregreatly optimized. In our algorithms, S is the dataset to beclustered; K is the optimal number of clusters; n is the sizeof S; m is the number of sub-clusters; M={(Ci, Cj)}, (i,

    j [1,t]) is the set of existing must-link constraints; C={(Ci,Cj)}, (i, j [1,t]) is the set of existing cannot-linkconstraints.Algorithm Find-KInput: SOutput: K

    begin1: partition S into t clusters: C1, C2Ct, according to thegeometry distribution of S2: repeat{3: for (i=1; i>K);

    3: repeat4: { for (each point x in S)5: assign x to the closest sub-cluster based on the

    distance to the centroid;6: update the centroid of each sub-cluster;7: } until (no points change between the t clusters) //utilize

    the K-means on the S, where K equals to t

    8: Compute the cohesion matrix X between the t clusters;9: If ( (Ci Cj ) M or (Ci Cj )C)10: implement CUCMC;11: If (Ci Cj ) violates M (C)

    12: w, w are enforced on the cohesion matrix;13: do{ Extract the maximal chs (Ci, Cj);14: If (Ci and Cj do not belong to the same sub-cluster)15: merge the two sub-clusters which they belong to a

    new subcluster;16: t:=t-1; } while (t>K).end

    In HCAKC, Find-K is firstly run in order to determinethe optimal number K of the dataset to be clustered. Then,K-means is adopted to form t clusters initially, in which t ismore than K. The cohesion matrix named X between tclusters is obtained base on formula (3). Afterwards, theexisting constraints sets M={(Ci, Cj)} and C={(Ci, Cj)} areconsidered to implement the CUCMC. The penalty factor isintroduced to address the constraint violation. When must-

    link (Ci, Cj) is violated, ( , )i jM C Cw is forced on the similarity

    metric according to formula (5). The row i column j in X isset at Sim(Ci, Cj).

    IV. EXPERIMENTAL RESULTSAll of our experiments have been conducted on a

    computer with 2.4Ghz Intel CPU and 512M main memory.The operating system of the computer is Microsoft

    Windows XP. HCAKC is compared with CSM to evaluatethe clustering quality and time performance of HCAKC.The algorithms are all implemented in Microsoft VisualC++6.0.

    We performed our experiments on the UCI datasets:Ionosphere, iris, breast-cancer, credit-g, page-blocks. Themust-link and cannot-link constraints are generatedartificially by utilizing the same method with [7]. Thedetails of datasets are shown in Tab.1. For instance, D1 isthe Ionosphere dataset consisting of 355 instances from twoclusters. Accuracy [7], one of the clustering qualitymeasures, is computed to compare the clustering resultsbetween HCAKC and CSM. We averaged the measures for100 trails on each dataset. Fig. 3 and Fig. 4 show the

    experimental results comparing HCAKC with CSM.HCAKC and CSM are run on D1 (i.e. Ionosphere

    dataset) with constraints respectively. Fig. 3 shows theaccuracy results comparing HCAKC with CSM on theIonosphere dataset.From Fig. 3, we can get the conclusionthat CSM is lower in accuracy compared with HCAKCwith varying size of constraints.

    In CSM, the constraints are not considered. In HCAKC,the constraints are incorporated into the hierarchicalclustering to update the cohesion matrix, and the constraints

    1481

  • 8/8/2019 KHALIQ PAPER

    4/4

    violation is addressed as well. Thus, HCAKC is better interms of clustering accuracy.

    The experiments have been conducted on the datasets:iris, breast-cancer, credit-g and page-blocks respectively tocompare the time efficiency of HCAKC with that of CSM.From Fig. 4, we can conclude that HCAKC is better thanCSM in CPU running time on different datasets.

    The cluster number K needs to be specified as aparameter before the CSM algorithm. The time cost of theparameter setting is expensive, since the K-means needs to be run iteratively. HCAKC finds out the optimal K viacomputing the average IS of the points in datasets, and thetime cost of this process is insignificant. The time efficiencyof HCAKC is obvious even when the scale of the dataset islarge.

    TABLE.1 PARAMETERS IN TESTING DATASET

    00. 10. 2

    0. 30. 40. 50. 60. 70. 80. 9

    1

    5 10 15 20 25 30 35 40

    Constraints Ratio/Size(%)

    A

    ccuracy(%)

    CSM HCAKC

    Figure 3. HCAKC and CSM comparison in terms of accuracy

    01020304050607080

    D2 D3 D4 D5

    Datasets

    Run

    ningtime(%)

    HCAKC CSM

    Figure 4. HCAKC and CSM comparison in terms of running time

    V. CONCLUSIONIn order to improve the time efficiency and clustering

    quality of CSM, a new method named HCAKC is proposedin this paper. In our proposed algorithm, the curve graphabout average IS of the dataset and different partitionnumber has been plotted. The optimal number of clusters isdetermined by locating the maximum of the curve graph. Asa result, the complexity of the process when determining thenumber of clusters has been improved. Thereafter, theexisting constraints have been incorporated to complete theCUCMC during the hierarchical clustering process. Thepenalty factor is introduced into our algorithm to address theconstraints violation. Hence, the clustering quality has beenimproved. The results of the experiments have demonstratedthat the HCAKC algorithm is efficient in reducing the timecomplexity and increasing the clustering quality.

    REFERENCES

    [1] L. Sun, T. C. Lin, H. C. Huang, B. Y. Liao, and J. S. Pan, Anoptimized approach on applying genetic algorithm to adaptive clustervalidity index, 3rd International Conference on International

    Information Hiding and Multimedia Signal Processing, Kaohsiung,Taiwan, Nov. 2007, vol. 2, pp. 582-585.

    [2] C.R. Lin, M.S. Chen, Combining Partitional and HierarchicalAlgorithms for Robust and Efficient Data Clustering with CohesionSelf-Merging, IEEE Transaction On Knowledge and Data

    Engineering, 2005, 17(2): 145-159.

    [3] H. J. Sun, S. R. Wang and Q. S. Jiang, FCM-Based model selectionalgorithms for determining the number of cluster, Pattern

    Recognition, 2004, vol. 37(10), pp. 20272037.

    [4] S. Lamrous, M. Taileb, Divisive Hierarchical K-Means, CIMCA2006: International Conference on Computational Intelligence forModeling, Control and Automation, Jointly with IAWTIC 2006:

    International Conference on Intelligent Agents Web Technologiesand Internet Commerce, Sydney, NSW, Australia, 2006, pp. 18-23.

    [5] S. C. Chu, J. F. Roddick, C. J. Su and J. S. Pan, Constrained AntColony Optimization for data clustering, 8th Pacific RimInternational Conference on Artificial Intelligence, PRICAI 2004:

    Trends in Artificial Intelligence, Auckland, New Zealand, 2004, vol.

    3157, pp. 534-543.

    [6] C. Ruiz, M. Spiliopoulou and E. Menasalvas, C-DBSCAN:Density-based clustering with constraints, In 11th International

    Conference on Rough Sets, Fuzzy Sets, Data Mining and GranularComputing, Toronto, Canada, 2007, pp. 216-223.

    [7] K. Wagstaff, C. Cardie, S. Rogers and S. Schroedl, Constrained K-means clustering with background knowledge, Proceeding of the17th International Conference on Machine Learning, 2001, pp. 577-

    584.

    [8] I. Davidson and S. S. Ravi, Agglomerative hierarchical clusteringwith constraints: Theoretical and empirical results, 9th EuropeanConference on Principles and Practice of Knowledge Discovery in

    Databases, Porto, Portugal, 2005, pp. 59-70.

    [9] M. Bilenko, S. Basu and R. J. Mooney, Integrating constraints andmetric learning in semi-supervised clustering, Proc. of the 21st Intl

    Conf. on Machine Learning, New York, ACM Press, 2004,pp.8188.

    Dataset Name Size Clusters

    D1 Ionosphere 355 2

    D2 Iris 150 3

    D3 breast-cancer 277 2

    D4 credit-g 1000 2

    D5 page-blocks 5473 5

    1482