NGHIÊN CỨU VÀ CÀI ĐẶT MỘT SỐ GIẢI THUẬT PHÂN CỤM, PHÂN LỚP

Embed Size (px)

Citation preview

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    1/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    2/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    3/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    4/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    5/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    6/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    7/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    8/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    9/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    10/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    11/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    12/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    13/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    14/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    15/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    16/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    17/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    18/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    19/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    20/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    21/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    22/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    23/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    24/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    25/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    26/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    27/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    28/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    29/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    30/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    31/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    32/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    33/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    34/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    35/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    36/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    37/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    38/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    39/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    40/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    41/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    42/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    43/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    44/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    45/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    46/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    47/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    48/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    49/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    50/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    51/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    52/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    53/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    54/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    55/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    56/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    57/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    58/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    59/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    60/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    61/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    62/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    63/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    64/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    65/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    66/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    67/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    68/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    69/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    70/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    71/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    72/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    73/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    74/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    75/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    76/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    77/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    78/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    79/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    80/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    81/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    82/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    83/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    84/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    85/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    86/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    87/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    88/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    89/119

    -88-

    (CF - Clustering Feature) v cy CF (Clustering Feature tree), s dng cy CFi din mt cm tm tt c c tc v khnng m r ng phn cm tttrong cc c s d liu l n. N cng tt i v i phn cm tng tr ng ng caccim dliu u vo.

    Mt c tr ng phn cm CF l mt bba thng tin tm tt vcm con ccim. Cho tr c N im c h ng { X i} trong mt cm con, CF c nh ngh anhsau:

    ),,( SS LS N CF = (3.23)

    v i N l s cc im trong cm con, LS l tng tuyn tnh trn N

    im= N i i X 1r

    v SS l tng bnh ph ng ca ccim dliu2

    1 i

    N

    i X =

    r.

    Mt cy CF l mt cy cn bng chiu cao, n lu tr cc c tr ng phncm. N c hai tham s: hsphn nhnh B v ng ng T . Hsphn nhnh ch r s l ng ti a cc con. Tham sng ng ch r ng knh ti a ca cccm con c lu tr ti cc nt l. Bng cch thayi gi tr ng ng, n c th thayi kch th c ca cy. Cc nt khng phi l l lu tr tng cc CFs cacc nt con, do vy, tm tt thng tin vcc con ca chng.

    Gii thut BIRCH c hai pha sauy: Pha 1: Qut c s d liu xy dng mt cy CF b nh trong ban u, n c th c xem nh l nna mc ca d liu m n cgng bo ton cu trc phn cm vn c ca dliu.

    Pha 2: p dng mt gii thut phn cm ( la chn) phn cmcc nt l ca cy CF.

    Trong pha 1, cy CF c xy dng ng khi ccim d liu c chn

    vo. Do vy, ph ng php ny l mt ph ng php tng tr ng. Mt im cchn vo t i entry (cm con) l gn nht. Nu nh ng knh ca cm conlu tr nt l sau khi chn l n h n gi tr ng ng, th nt l v cc nt c th khc b chia. Sau khi chn mt im m i, thng tin vn c a qua theoh ng gc ca cy. Ta c th thay i kch th c cy CF bng cch thayi

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    90/119

    -89-

    ng ng. Nu nhkch th c bnh cn thit lu tr cy CF l l n h n kchth c bnh chnh th mt gi tr nhh n ca ng ng c ch nh v cy CF c xy dng li. X l xy dng li ny c biu din bng cch xy dngmt cy m i tcc nt l ca cy c. Do vy, x l xy dng li cy c lmm khng cn c li tt cccim. B i vy, xy dng cy, dliu ch phic mt ln. Nhiu heuristic v cc ph ng php cng c gi i thiu giiquyt cc outlier v ci thin cht l ng cy CF b i cc ln qut thm vo cadliu.

    Sau khi cy CF c xy dng, bt k mt gii thut phn cm no, v d nhgii thut phn chiain hnh c th c dng v i cy CF trong pha 2.

    BIRCH cgng a ra cc cm tt nht v i cc ti nguyn c sn. V i s l ng gi i hn ca bnh chnh, mt xem xt quan tr ng l cn ti thiu hoth i gian yu cu i v i I/O. N p dng k thut phn cm nhiu pha: qut n t p d liu mang li mt c s phn cm tt, v mt hay nhiu ln qutthm vo (tu ) c dngci thin xa h n cht l ng. B i vy phc t ptnh ton ca gii thut l O( N ), v i N l scci t ng c phn cm.

    Bng cc th nghim thy c khnng m r ng tuyn tnh ca gii

    thut vmt s l ng ccim v cht l ng tt ca phn cm d liu. Tuynhin, mi nt trong cy CF c thch nm gimt s l ng gi i hn cc entry b i kch th c ca n, mt nt cy CF khng phi lun lun t ng ng v imt cm t nhin. H n na, nu cc cm khng phi c hnh cu, BIRCH s khng thc hin tt b i n sdng khi nim bn knh hay ng knh iukhin ng bao mt cm.3.5.3 CURE: Phn c m s d ng cc i di n

    Hu ht cc gii thut phn cm hoc l cu i cc cm c dng hnhcu v kch th c ging nhau, hoc l r t mong manh v i shin din ca ccoutlier. Mt ph ng php th v gi l CURE (Clustering UsingREpresentatives) (Guha, Rastogi v Shim 1998), tch h p cc gii thut phn

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    91/119

    -90-

    chia v phn c p, khc phc vn u i cc cm c dng hnh cu v kchth c ging nhau.

    CURE cung c p mt gii thut phn cm phn c p m i ltheo v tr gia(middle ground) gia vic da trn tr ng tm v tt ccc cc im. Thay v s dng mt tr ng tm n i din mt cm, CUREn nh mt s l ng ccim i din c la chn miu tmt cm. Ccim i din ny csinh ra bng cch tr c tin la chn ccim r i rcu trong cm, sau cochng li vpha tm cm b i mt phn s (h s co). Cc cm v i c p ccim i din gn nht s c ho nh p ti mi b c ca gii thut.

    Mi cm c h n mt im i din cho php CUREiu chnh tt hnh

    hc ca cc hnh khng phi hnh cu. Vic co li gip lm gim i hiu qucacc outlier. B i vy, CURE thc smnh h n i v i cc outlier v nhn bitcc cm khng c dng hnh cu v i kch th c khc nhau nhiu.

    vn dng cc c s d liu l n, CURE dng k t h p ly mu v phnchia ngu nhin: Mt mu ngu nhin tr c tin c phn chia v mi phnchia c phn cm cc b. Cc cm cc bsau c phn cm ln thhaic c cc cm mong mun.

    Cc b c chnh ca gii thut CURE c phc hovn tt nh sau: (1)Ly mt mu ngu nhin s; (2) Phn chia mu s thnh p phn, mi phn c kchth c s/ p; (3) Cm cc bphn chia thnh s/ pq cm q>1; (4) Kh cc outlier bng cch ly mu ngu nhin: Nu mt cm tng tr ng qu chm, loi bn;(5) Phn cm cc cm cc b, mt x l co nhiu im i din vpha tr ng

    tm bng mt phn s c ch nh b i ng i dng, ti cci din c c hnh dng ca cm; (6)nh du dliu v i nhn cm t ngng.

    Sauy ta biu din mt v d thy cch lm vic ca CURE.V d3.5: Gisc mt t p cci t ng c nh v trong mt hnh ch

    nht. Cho p = 2, ng i dng cn phn cm cci t ng vo trong hai cm.

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    92/119

    -91-

    Hnh 3.6: Phn cm mt t p ccim bng CURETr c tin, 50i t ng c ly mu nh hnh 3.6a). Sau, cci

    t ng ny c phn chia banu vo trong hai cm, mi cm cha 50 im.Ta phn cm cc bcc phn chia ny thnh 10 cm con da trn khong cchtrung bnh ti thiu. Mi i din cm c nh du b i mt ch th p nh, nh hnh 3.6b). Cci din ny c di chuyn vpha tr ng tm b i mt phn s

    , nhhnh 3.6c).Ta c c hnh dng ca cm v thit l p thnh 2 cm. Dovy, cci t ng c phn chia vo trong hai cm v i cc outlier c g b nhbiu din hnh 3.6d ).

    CUREa ra cc cm cht l ng cao v i shin hu ca cc outlier, cchnh dng phc t p ca cc cm v i cc kch th c khc nhau. N c khnng

    m r ng tt cho cc c s d liu l n m khng cn hy sinh cht l ng phncm. CURE cn mt t cc tham s c ch nh b i ng i dng, nh kch

    th c ca mu ngu nhin, s l ng cc cm mong mun v hs co . nhy mt php phn cm c cung c p da trn k t quca vic thayi cctham s. Mc du nhiu tham sb thayi m khngnh h ng t i cht l ng phn cm nhng tham sthit l p nhn chung cnh h ng ng k .

    Mt gii thut phn cm phn c p tch ng khc c pht trin b i

    (Guha, Rastogi v Shim 1999) gi l ROCK, n ph h p cho vic phn cm ccthuc tnh xc thc. No t ng ng ca 2 cm bng cch so snh ton b lin k t ni ca 2 cm da trn m hnh lin k t ni t nh c ch nh b ing i dng, ti lin k t ni ca hai cm C1 v C2 c nh ngh a b i s

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    93/119

    -92-

    l ng cc lin k t cho gia hai cm v lin k t link ( p i, p j) l s l ng cc lngging chung gia haiim p i v p j.

    ROCK tr c tin xy dng th tha tmt ma tr n t ng ng d liucho tr c, sdng mt ng ng t ng ng v khi nim cc lng ging chia s,v sau biu din mt gii thut phn cm phn c p trnth tha.3.5.4 CHAMELEON: M t gi i thu t phn c m phn c p s d ng m hnhng

    Mt gii thut phn cm th v khc gi l CHAMELEON, n kho st mhnh hong trong phn cm phn c p, c pht trin b i Karypis, Han vKumar (1999). Khi x l phn cm, 2 cm c ho nh p nu lin k t ni v

    cht (gn) gia hai cm c lin k t cao v i lin k t ni v cht ni tica cci t ng nm trong phm vi cc cm. X l ho nh p da trn m hnhng to iu kin thun l i cho skhm ph ra cc cm tnhin vng nht,n p dng cho tt ccc kiu dliu min l hm t ng ng c ch nh.

    CHAMELEON c c da trn quan st cc yu im ca hai gii thut phn cm phn c p: CURE v ROCK. CURE v cc l c quan hb quathng tin v lin k t ni tng thca cci t ng trong 2 cm; ng c li,

    ROCK, cc l c quan hl i thng tin v cht ca 2 cm trong khi nhnmnh lin k t ni ca chng.

    CHAMELEON tr c tin sdng mt gii thut phn chiath phncm cc mc d liu vo trong mt s l ng l n cc cm con t ng i nh.Sau dng gii thut phn cm phn c p t p h p tm ra cc cm xc thc bng cch l p li vic k t h p cc cm ny v i nhau.xc nh cc c p cmcon ging nhau nht, cn nh gi clin k t ni cng nh cht ca cc cm,c bit l ccc tnh ni ti ca bn thn cc cm. Do vy n khng tu thucvo mt m hnh t nh c cung c p b i ng i dng v c tht ng thchngv i ccc tnh ni ti ca cc cm ang c ho nh p.

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    94/119

    -93-

    Hnh 3.7: CHAMELEON: Phn cm phn c p da trn k-lng ging gn v mhnh hong

    Nh hnh 3.7, CHAMELEON miu tcc i t ng da trn ti p cn th c dng phbin: k-lng ging gn nht. Mi nh ca th k-lng ginggn nht i din cho mt i t ng d liu, ti tn ti mt cnh gia hainh (i t ng), nu mt i t ng l gia k i t ng ging nhau so v i cci t ng khc.th k-lng ging gn nht Gk c c khi nim lng gingng: Bn knh lng ging ca mt im d liu c xcnh b i mt camin m trong cc i t ng c tr. Trong mt min dy c, lng ging c nh ngh a h p, v trong mt min tha th t, lng ging c nh r ngh n. So snh v i m hnhnh ngh a b i ph ng php da trn mt nh DBSCAN (gi i thiu mc sau), DBSCAN dng mt lng ging ton cc,G

    k c c lng ging tnhin h n. H n na, mt min c ghi nh tr ng

    s ca cc cnh. Cnh ca mt min dyc theo tr ng s l n h n so v i camt min tha th t.

    CHAMELEON ch r s t ng ng gia mi c p cc cm C i v C j theolin k t ni t ng i RI (C i,C j) vcht t ng i RC (C i,C j) ca chng.

    Lin k t ni t ng i RI (C i,C j) gia hai cm C i v C j c nh ngh a nh lin k t ni tuyt i gia C i v C j tiu chun hoi v i lin k t ni ni ti

    ca hai cm C i v C j. l:

    ( ) { }( )

    ji

    ji

    C C

    C C

    ji

    EC EC

    EC C C RI

    +=

    21,

    , (3.24)

    v i { } ji C C EC , l cnh ct (edge-cut ) ca cm cha c C i v C j cm ny

    c r i vo trongC i v C j, v t ng tnhvy, ECC i (hay ECC j) l kch th c

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    95/119

    -94-

    ca min-cut bisector (tc l tng tr ng sca cc cnh m chiath thnh hai phn th bng nhau).

    cht t ng i gia mt c p cc cm C i v C j l RC (C i,C j) c nhngh a nh l cht tuyt i gia C i v C j c tiu chun hoi v i lin k tni ni ti ca hai cm C i v C j. l:

    ( ) { }

    jC iC

    ji

    EC

    ji

    j EC

    ji

    i

    C C EC ji

    S C C

    C S

    C C

    C

    S C C RC

    ++

    +

    = ,, (3.25)

    v i { } jC iC EC S , l tr ng s trung bnh ca cc cnh k t ni ccnh trongC i t i

    cc nh C j v iC EC S (hay jC EC S ) l tr ng s trung bnh ca cc cnh thuc v

    min-cut bisecter ca cm C i (hayC j). Nhvy, CHAMELEON c nhiu khnng khm ph ra cc cm c hnh

    dng tu v i cht l ng cao h n so v i DBSCAN v CURE. Tuy vy, th igian chi ph x l cho d liu c chiu cao c thl O(n2) chon i t ng trongtnh hung xu nht.3.6 Cc ph ng php phn cm d a trn mt

    tm ra cc cm v i hnh dng tu , cc ph ng php phn cm datrn mt c pht trin, n k t ni cc min v i mt cao vo trongcc cm hay phn cm cci t ng da trn phn bhm mt .3.6.1 DBSCAN: Ph ng php phn c m d a trn m t trn cc mi n c k t n i v i mt cao

    DBSCAN (Density-Based Spatial Clustering of Applications with Noise) lmt gii thut phn cm da trn mt , c pht trin b i Ester, Kriegel,Sander v Xu (1996). Gii thut ny tng tr ng cc min v i mt cao votrong cc cm v tm ra cc cm v i hnh dng tu trong c s d liu khnggian c nhiu. Mt cm c nh ngh a nh l mt t p cc i ccim c k tni da trn mt .

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    96/119

    -95-

    t ng c bn ca phn cm da trn mt nh sau: i v i mi i

    t ng ca mt cm, lng ging trong mt bn knh cho tr c ( ) (gi l -lngging) phi cha cha t nht mt sl ng ti thiu cci t ng ( MinPts ).

    Mt i t ng nm trong mt bn knh cho tr c ( ) cha khng t h nmt s l ng ti thiu cci t ng lng ging ( MinPts ), c gi l i t ng

    nng ct (core object ) (i v i bn knh v s l ng ti thiu cc im MinPts ).

    Mt i t ng p l mt tr c ti p tin (directly density-reachable ) t i

    t ng q v i bn knh v s l ng ti thiu ccim MinPts trong mt t p cc

    i t ng D nu p trong phm vi -lng ging ca q v i q cha t nht mt s

    l ng ti thiu ccim MinPts .Mt i t ng p l mt tin (density-reachable ) t i t ng q v i bn

    knh v MinPts trong mt t p cci t ng D nu nhc mt chui i t ng

    p1, p2,..., pn, p1=q v pn= p v i 1 i n, p i D v p i+1 l mt tr c ti p tin t p i i v i v MinPts .

    Mt i t ng p l mt lin k t v i i t ng q i v i v MinPts

    trong mt t p i t ng D nu nhc mt i t ng o D c p v q l mttin t o i v i v MinPts .

    V d 3.6: Trong hnh 3.8, cho tr c i din cho bn knh cc ngtrn, cho MinPts =3, M l mt tr c ti p tin t P ; Q l mt (khng tr cti p) tin t P . Tuy nhin P khng phi l mt tin t Q. T ng tnhvy, R v S l mt tin t O; vO, R v S tt cl mt lin k t.

    Hnh 3.8: Mt tin v mt lin k t trong phn cm da trn mt

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    97/119

    -96-

    Lu r ng mt tin l bc cu ng (transitive closure) ca mt tr cti p tin, v quan hny l khngi xng. Ch cc i t ng nng ct l mttin ln nhau (giao hon). Mt lin k t l mt quan h i xng.

    Mt cm da trn mt l mt t p cci t ng mt lin k t l ti ai v i mt tin; mi i t ng khng cha trong bt k mt cm no lnhiu.

    Da trn khi nim mt tin, gii thut phn cm da trn mt DBSCAN c pht trin phn cm d liu trong c s d liu. N kim

    sot -lng ging ca mi im trong c s d liu. Nu nh -lng ging camt im p cha nhiu h n MinPts , mt cm m i v i p l i t ng nng ct

    c thit l p. Sau l p li vic t p h p cc i t ng tr c ti p t cc it ng nng ct ny, n c thbao gm vic ho nh p mt vi cm mt tin.Xl ny dng khi khng cim m i no c thm vo bt k cm no.3.6.2 OPTICS: S p x p cc i m nh n bi t c u trc phn c m

    Mc du gii thut phn cm da trn mt DBSCAN c thtm ra cm

    cc i t ng v i vic la chn cc tham s u vo nh v MinPts , ng idng vn chu trch nhim la chn cc gi tr tham s tt tm ra cc cm

    chnh xc. Trn thc t, y l bi ton c sk t h p ca nhiu gii thut phncm khc. Cc thit l p tham snhvy th ng kh khxc nh, c bittrong thgi i thc, cc t p d liu schiu cao. Hu ht cc gii thut r t nhyv i cc gi tr tham s: cc thit l p c skhc bit nhc thdn t i cc phnchia d liu r t khc nhau. H n na, cc t p d liu thc schiu cao th ng c phn br t lch, thm ch khng tn ti mt thit l p tham s ton cc chou vo, k t quca mt gii thut phn cm c thm tbn cht cu trc phn cm mt cch chnh xc.

    khc phc kh khn ny, mt ph ng php s p x p cm gi l OPTICS(Ordering Points To Identify the Clustering Structure) c pht trin b i(Ankerst, Breunig, Kriegel v Sander 1999). N tnh mt s p x p phn cm tngdn cho php phn tch cm t ng v t ng tc. S p x p phn cm ny cha

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    98/119

    -97-

    ng thng tin t ng ng v i phn cm da trn mt ph h p v i mt phm vi r ng cc thit l p tham s.

    Bng cch kho st gii thut phn cm da trn mt , DBSCAN c th ddng thy r ng i v i mt gi tr hng s MinPts , cc cm da trn mt

    i v i mt cao h n (tc l mt gi tr th p h n) c cha hon ton trongcc t p mt lin k t i v i mt mt th p h n. B i vy, a ra cc cmda trn mt v i mt t p cc tham skhong cch, gii thut cn la chncci t ng xl theo mt tr t tcth i t ng l mt tin i v i

    gi tr th p nht c k t thc tr c tin.Da trn t ng ny, hai gi tr cn c lu tr i v i mi i t ng:

    khong cch nng ct (core-distance ) v khong cch tin (reachability-distance ).

    Khong cch nng ct ca mt i t ng p l khong cch nhnht ' gia

    p v mt i t ng trong - lng ging ca n p sl mt i t ng nng ct

    i v i ' nu nh lng ging ny c cha trong - lng ging ca p. Nukhng th khong cch nng ct l khng xcnh.

    Khong cch tin ca mt i t ng p i v i mt i t ng o khc lkhong cch nhnht p l mt tr c ti p tin t o nu o l mt i t ngnng ct. Nu o khng phi l mt i t ng nng ct, ngay cti khong cch

    pht sinh , khong cch tin ca mt i t ng p i v i o l khng xcnh.Gii thut OPTICS to l p tr t tca mt c s d liu, thm vo l lu

    tr khong cch nng ct v mt khong cch tin ph h p v i mi i t ng.Thng tin nhvy lcho srt trch ca tt ccc phn cm da trn mt

    i v i bt k mt khong cch ' nh h n khong cch pht sinh t tr t t ny.

    S p x p cm ca mt t p d liu c th c trnh by v hiu bng th.V d, hnh 3.9 l mt biu tin cho mt t p d liu hai chiu n gin, n biu din mt ci nhn tng qut vd liu c cu trc v phn cm nh th

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    99/119

    -98-

    no. Cc ph ng php cng c pht trin quan st cc cu trc phn cmcho dliu schiu cao.

    Hnh 3.9: S p x p cm trong OPTICSB i t ng ng cu trc ca gii thut OPTICS t i DBSCAN, gii thut

    OPTICS c cngphc t p th i gian chy nh ca DBSCAN. Cc cu trcnh ch skhng gian c th c dngnng cao khnng biu din can.

    3.6.3 DENCLUE: Phn c m d a trn cc hm phn b mt DENCLUE (DENsity -based CLUstEring - phn cm da trn mt )

    (Hinneburg v Keim 1998) l ph ng php phn cm da trn mt t p cc hm phn bmt .

    Ph ng php c da trn t ng sau: (1) Tcng ca mi im d liu c th c lm m hnh chnh thc sdng mt hm ton hc gi l hmtc ng, hm tcng c xem nh l mt hm m ttc ng ca mt imd liu trong phm vi lng ging ca n; (2) Ton bmt ca khng gian d liu c th c lm m hnh theo php phn tch tng cc hm tcng ca ttccc im d liu; (3) Cc cm sau c th c xcnh chnh xc bngcch nhn bit cc attractor mt , ti cc attractor mt cc i cc b ca ton bhm mt .

    Hm tcng ca mt im d liu y F d , v i F d l mt khng gianctr ng d chiu, l mt hm c bn + 0: R F f d y B , c nh ngh a d i dng mt

    hm tcng c bn f B:

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    100/119

    -99-

    ( ) y x f f B y B ,= (3.26)

    Theo nguyn tc, hm tcng c th l mt hm tu nhng n nn l phn xv i xng. N c thl mt hm khong cch Euclidean, mt hm tcng wave bnh ph ng:

    ( ) >=

    otherwise

    y xd if y x f Square 1

    ),(0, (3.27)

    hay mt hm tcng Gaussian:( )

    2

    2

    2,

    ),( y xd

    Gause e y x f

    = (3.28)

    Hnh 3.10: Hm mt v attractor mt Mt hm mt c nh ngh a l tng cc hm tcng ca tt ccc

    im d liu. Cho tr c N i t ng d liu c m tb i mt t p cc vect c tr ng D = { x1,..., x N } F D, hm mt c nh ngh a nhsau:

    ( )== N i x B D B x f f i1 (3.29)V d, hm mt cho k t quthm tcng Gaussian (3.28) l:

    ( )

    =

    = N i

    y xd DGaussian e x f 1

    2,

    2

    2

    )( (3.30)

    T hm mt , ta c th nh ngh a dc (gradient) ca mt hm vattractor mt (attractor mt l cc i cc b ca ton b hm mt ).i v i mt hm tcng lin tc v phn bit, mt gii thut leo i (hillclimbing), c ch ra b i dc (gradient), c th c dng xc nhattractor mt ca mt t p ccim dliu.

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    101/119

    -100-

    Da trn cc khi nim ny, c cm c nh ngh a trung tm v cmhnh dng tu c th c nh ngh a chnh thc. Mt cm cnh ngh a trungtm l mt t p conC ang l mt c rt trch, v i hm mt khng t h n

    mt ng ng , ng c li (tc l nu hm mt nhh n ng ng ) th n lmt outlier. Mt cm hnh dng tu l mt t p ca t p con ca C , mi t p ang

    l mt c rt trch, v i hm mt khng t h n mt ng ng , v tn timt ng i P tmi min t i nhng min khc v hm mt cho mi im

    dc theo ng i khng t h n .DENCLUE c cc thun l i chnh sauy khi so snh v i cc gii thut

    phn cm khc: (1) N c mt nn tng ton hc vng chc, tng qut ho cc

    ph ng php phn cm khc, bao gm cc ph ng php da trn phn chia, phn c p v da trn v tr; (2) N c ccc tnh phn cm tt i v i cc t pd liu v i sl ng nhiu l n; (3) N cho php mt m tton hc cng cacc cm c hnh dng tu trong cc t p d liu schiu cao; (4) N sdngcc l i nhng ch gi thng tin vcc l i m thc scha ng ccimd liu v qun l cc ny trong mt cu trc truy c p da trn cy v do vyn nhanh h n ng k so v i cc gii thut tc ng, nh n nhanh h nDBSCAN t i 45 ln. Tuy vy, ph ng php cn schn la cn thn cc thams, tham smt v ng ng nhiu , vic la chn cc tham snhvy cnh h ng ng k cht l ng ca cc k t quphn cm.

    Hnh 3.11: Cc cm c nh ngh a trung tm v cc cm c hnh dng tu

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    102/119

    -101-

    3.7 Cc ph ng php phn cm d a trn l iMt ti p cn da trn l i dng cu trc d liu l i a phn gii. Tr c

    tin n l ng t ho khng gian vo trong mt s hu hn cc m hnhthnh nn cu trc l i, sau thc hin tt ccc thao tc trong cu trc l i. Thun l i chnh ca ti p cn ny l th i gian x l nhanh,in hnh lcl p ca s l ng cci t ng d liu nhng c l p ch trn s l ng cc trong mi chiu trong khng gian l ng tha.

    Cc v d in hnh ca ti p cn da trn l i bao gm STING - kho stthng tin thng k c lu tr trong cc l i; WaveCluster - cc cm it ng sdng ph ng php bin i wavelet; CLIQUE - miu tmt ti p cn

    da trn l i v mt cho phn cm trong khng gian dliu schiu cao.3.7.1 STING: M t ti p cn l i thng tin th ng k

    STING (STatistical INformation Grid) (Wang, Yang v Munz 1997) l mtti p cn a phn gii da trn l i. Trong ti p cn ny, min khng gian cchia thnh cc hnh ch nht. Th ng c mt vi mc cc hnh ch nhtt ngng v i cc mc khc nhau ca phn gii v cc ny thit l p nn mtcu trc phn c p: mi ti mt mc cao c phn chiahnh thnh nn mt

    s l ng cc ti mc th p h n ti p theo. H n na, cc phn quan tr ng cathng tin thng k nh mean , max , min, count , lch chun ( standard deviation ), v.v... k t h p v i cc gi tr thuc tnh trong mi l i c tnhton tr c v c lu tr tr c khi mt truy vn c submit t i mt hthng.

    Hnh 3.12 cho thy mt cu trc phn c p i v i phn cm STING.

    Hnh 3.12: Mt cu trc phn c p i v i phn cm STING

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    103/119

    -102-

    T p cc tham sda trn thng k bao gm: - tham s c l p v i thuctnh n (count ) v cc tham sphthuc thuc tnhm (mean ), s (lch chun),min (minimum),max (maximum), v kiu ca phn bm gi tr thuc tnhtrong ti p theo nh normal- bnh th ng, uniform- ng nh t, exponential- s m , hay none (nu phn bkhng c bit). Khi d liu c ti vo trong c s d liu, t p cc tham s n, m, s, min, max ca cc mc y c tnh tontr c ti p td liu. Gi tr ca phn bc th c n nh b i ng i dng nunh kiu phn b khng c bit tr c hay c c b i cc kim nh gi

    thuyt nh kim nh 2. Cc tham s ca cc mc cao h n c thddng c tnh tcc tham s cc mc th p h n. Kiu phn bca cc mc cao

    h n c th c tnh ton da trn cc kiu phn b theo s ng ca cc t ng ng mc th p h n ca n cng v i mt ng ng x l lc. Nu nhcc phn bca mc th p h n khng ging nhau v thiu ng ng kim nh, kiu phn bca mc cao c t l "none ".

    Thng tin thng k c c s r t hu ch khi tr l i cc truy vn. Top-down l ph ng php tr l i truy vn da trn l i thng tin thng k c th khi qut nhsau: Tr c tin n c thxc nh mt l p bt u, n th ng

    bao gm mt s l ng nhcc .i v i mi trong l p hin th i, ta tnh tonkhong tin cy (hay phm vi c nh gi) khnng m ny c lin quan t itruy vn. Cc khng lin quan s c g bkhi xem xt sau ny, v x l mc su h n sch xem xt cc lin quan. X l ny c l p li cho t i khin tin n l p y. Ti th i im ny, nu t c truy vn ch nh th str li cc min cc lin quanp ng yu cu ca truy vn; mt khc, ly ra d liu nm trong cc lin quan, ti p tc x l; v tr li cc k t quthomn yucu ca truy vn.

    Ti p cn ny a ra mt s thun l i so v i cc ph ng php phn cmkhc: (1) Tnh ton da trn l i l truy vn c l p, t thng tin thng k c lu tr trong mi i din cho thng tin tm tt ca d liu trong l i,c l p v i truy vn; (2) Cu trc l i lm cho x l song song v c p nht tng

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    104/119

    -103-

    tr ng c thun l i; (3) Thun l i chyu ca ph ng php ny hiu quca ph ng php: STING xuyn sut d liu mt ln tnh ton cc tham sthngk ca cc , v do vy phc t p th i gian pht sinh cc cm l O( N ), v i N l tng scci t ng. Sau khi pht sinh cu trc phn c p ny, th i gian xltruy vn l O(G), v i G l tng scc l i ti mc th p nht, n th ng nh h n nhiu so v i N - tng scci t ng.

    Tuy vy, t khi STING s dng ti p cn a phn gii thc hin php phn tch cm, cht l ng ca phn cm STING s tu thuc vo sn(granularity) ca mc th p nht ca cu trc l i. Nu sn l r t tt, chi phx l vc bn s tng ln; tuy nhin nu nhmc y ca cu trc l i qu

    th, n c th gim cht l ng tt (mn) ca php phn cm. H n na,STING khng xem xt mi quan hkhng gian gia cc con v cc lngging ca chngxy dng cc cha. K t qul hnh dng ca cc cm k tqu l nht qun (isothetic ), tt ccc ng bao cm theo chiu ngang hoctheo chiu dc, khng c chiu cho no c d thy. iu ny c thdn t icht l ng vchnh xc cc cm th p h n nhng c th i gian x l nhanhh n.

    3.7.2 WaveCluster: Phn c m s d ng php bi n i wavelet WaveCluster (Sheikholeslami, Chatterjee v Zhang 1998) l mt ti p cn

    phn cm a phn gii, tr c tin tm tt d liu bng cch l i dng cu trcl i a phn gii trn khng gian d liu, sau bin i khng gianc tr nggc bng php bin i wavelet v tm cc min ng c trong khng gian bin i.

    Trong ti p cn ny, mi l i tm tt thng tin ca mt nhm ccim,thng tin tm tt ny va a vo trong bnh chnh cho php bin iwaveleta phn gii v php phn tch cm sau. Trong cu trc l i, ccthuc tnh sca mt i t ng khng gian c th c i din b i mt vect c tr ng, ti mi phn tca vect t ng ng v i mt thuc tnh s, hay

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    105/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    106/119

    -105-

    unit;2) p dng php bin i wavelet trong khng gianc tr ng;3) Tm cc phn h p thnh k t ni (cc cm) trong cc di con cakhng gianc tr ng bin i ti cc mc khc nhau;4) Gn cc nhn vo cc unit;5) Lm cc bng tra cu v nh xcci t ng vo cc cm.

    Hnh 3.13: Gii thut phn cm da trn waveletphc t p tnh ton ca gii thut ny lO( N ) v i N l scc i t ng

    trong c s dliu.

    Hnh 3.14: Mt mu khng gianc tr ng 2 chiuV d: Hnh 3.14 (ly tSheikholeslami, Chatterjee v Zhang (1998)) cho

    thy mt mu khng gianc tr ng 2 chiu, ti , mi im trongnh i dincho cc gi tr c tr ng ca mt i t ng trong cc t p d liu khng gian.Hnh 3.15 (ly tSheikholeslami, Chatterjee v Zhang (1998)) cho thy k t qu ca cc php bin i wavelet ti cc t lkhc nhau, tmn (t l1) cho t ith (t l3). Ti mi mc, di con LL (bnh th ng) ch ra ti cung phn tphatrn bn tri, di con LH (cc cnh nm ngang) ch ra ti cung phn tpha trn bn phi v di con HL (cc cnh nm dc) ch ra ti cung phn tpha d i bntri v di con HH (cc gc) ch ra ti cung phn tpha d i bn phi.

    WaveCluster l mt gii thut da trn mt v l i. WaveCluster thchh p v i tt ccc yu cu ca cc gii thut phn cm tt: n x l cc t p d liu l n mt cch hiu qu, tm ra cc cm v i hnh dng tu , thnh cng trongvic x l cc outlier, v khng nhy cm i v i tr t t u vo. So v i

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    107/119

    -106-

    BIRCH, CLARANS v DBSCAN, WaveCluster lm tt h n cc ph ng phpny chiu sut v cht l ng phn cm.

    Hnh 3.15:a phn gii ca khng gianc tr ng trong hnh 3.14. a) t l1; b)t l2; c) t l3

    3.7.3 CLIQUE: Phn c m khng gian s chi u cao

    Mt gii thut phn cm khc, CLIQUE, Agrawal et al. (1998), tch h p ph ng php phn cm da trn l i v mt theo mt cch khc. N r t huch cho phn cm dliu v i schiu cao trong cc c s d liu l n.

    Cho tr c mt t p l n cc im d liu a chiu, cc im d liu nyth ng nm khngng nht trong khng gian d liu. Phn cm d liu nhn bit cc v tr tha th t hayngc, do vy tm ra ton bcc mu phn bcat p dliu.

    Mt unit l dyc nu nhphn nhca ccim d liu cha trong unitv t qu mt tham sm hnhu vo. Mt cm l mt t p l n nht cc unitdyc c k t ni.

    CLIQUE phn chia khng gian d liu m chiu thnh cc unit hnh ch nht khng chng ln nhau, nhn bit cc unit dyc, v tm ra cc cm trongton b cc khng gian con ca khng gian d liu gc, sdng ph ng php pht sinh candidate (ng c) ging v i gii thut Apriori cho khai ph cc lutk t h p.

    CLIQUE thc hin phn cm a chiu theo hai b c:Tr c tin, CLIQUE nhn bit cc cm bng cch xcnh cc unit dy

    c trong ton b cc khng gian con ca cc interest v sau xc nh ccunit dyc c k t ni trong ton bcc khng gian con ca cc interest.

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    108/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    109/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    110/119

    -109-

    + Bt u bng tn mt thuc tnh, du ":", sau l cc gi tr r ir c ca thuc tnh (nu thuc tnh l xc thc hay nh phn) hoc kiuthuc tnh (nu thuc tnh c kiu lin tc).

    - Tt ccc phn ch thch c t sau du "|"Bng 4.1: Mt v dt p nh dng dliu *.names

    1, 2, 3.1: continuous.2: 1, 2, 3, 4. |categorical3: continuous.4: 0, 1. |binary

    4.2.2 T p m u d li uMi mu mt dng. Cc gi tr thuc tnh ca mu ghi tr c, cui cng l

    gi tr l p. Mi mt gi tr ny cch nhau b i du ",".Bng 4.2: Mt v dt p dliu *.data

    4.2.3 Ngu n d li uTrong khun khlun vn, dliu c ly t a chweb site:- ftp://ftp.ics.uci.edu/pub/

    4.3 Thit k ch ng trnhV i cc khi chc nng v d liu trn, ch ng trnh c thit k nh

    sau:

    0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,1,40,0,0,1,0,1,1,1,1,1,0,1,0,1,0,1,10,1,1,0,1,0,0,0,1,1,0,0,2,1,1,0,20,1,1,0,1,1,0,0,1,1,0,0,2,1,0,0,21,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,10,1,1,0,1,0,0,0,1,1,0,0,2,1,0,1,2

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    111/119

    -110-

    Hnh 4.1: Thit k ch ng trnh

    4.4 K t quth c nghim vnh gi4.4.1 Cc b c ti n hnh th c nghi m

    - Phn cm dliu bng gii thut Kmeans v Kmedoids- Gn nhn cho cc cm, nh gi, so snh hiu qugn nhn gia hai gii

    thut trn cho cc bsliu UCI (ch dng cc dliu c thuc tnh lin tc).- Gn nhn cho cc cm, nh gi hiu qugn nhn cho d liu c thuc

    tnh hn h p- Ci tin hiu quphn l p- So snh cht l ng phn loi v i ch ng trnh See5.

    Ch ng trnh See5 (phin bn 2.03) l cng csdng k thut cy quytnh v i gii thut C5.0 dngphn loi d liu c vit b i Ross Quinlan.Tnh hiu quca ch ng trnh ny c nhiu ng i cng nhn. V th, lunvn s dng n lm cng c so snh v i cc k t quphn loi thchin. Hn chca See5 (phin bn 2.03) ch dng c ti a 400 mu dliu.

    T p nhdng dliu

    ModuleGetNames

    T p mudliu

    ModuleGetData

    Cc thng tin:- Sl p, tn cc l p- S thuc tnh, tnthuc tnh, kiu thuctnh hay cc gi tr r ir c ca thuc tnh- S mu, gi tr ccthuc tnh v tn l pca mi mu

    Cc module phn cm

    Phn loi

    Phn l p

    Hin th k t qu K t quphnl p, phn loi

    Ci tin phn l p

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    112/119

    -111-

    4.4.2 Th c nghi mD i y l cc k t qu t c:

    4.4.2.1 Bi ton phn l p: c thc hin v i s l ng cc cm l K = 2, 4,6,8,10, 16. (Kmeans: ma; Kmedoids: md)

    Bng 4.3: K t quth nghim phn l p

    Smu phn l p ngK=2 K=4 K=6 K=8 K=10 K=16

    TnDL

    S mu

    ma md ma md ma md ma md ma md ma mdBrea 500 328 480 485 481 484 481 481 482 481 482 481 482Haber 306 225 225 229 226 230 231 228 232 233 234 237 240Iris 150 100 51 126 53 126 55 125 57 121 59 142 65Pima 768 532 504 539 537 541 528 525 554 554 558 558 561Glass 214 78 82 105 84 117 86 117 88 125 90 140 96Wine 178 107 72 173 80 169 85 168 84 166 87 173 93Balan 625 293 369 407 423 448 441 503 451 438 453 483 459

    So snh Kmeans v Kmedoids

    0

    20

    40

    60

    80

    100

    120

    b r e a s

    t c a n c

    e l

    h a b e

    r m a n i r i s p i m

    a g l a

    s s w i n e

    b a l a n

    c e

    Cc b d li u

    P h n

    t r m

    p h n

    l p

    n g

    Kmeans

    Kmedoids

    Hnh 4.2: Biu so snh Kmeans v Kmedoids trong bi ton phn l p v i

    K=10Biu trn cho thy v i d liu kiu lin tc khnng phn l p ca

    Kmedoids trong bd liu UCI th ng th p h n so v i Kmeans b i im idin trong Kmedoids l mt im i t ng gn tm cm, tm cm trong

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    113/119

    -112-

    Kmeans l gi tr trung bnh ca cc phn ttrong cm. Nu nhdliu t nhiuth Kmeans scho k t quhiu quh n Kmedoids, trong tr ng h p ng c li,nu mt nhiu v i gi tr cc l n, vc bn n sbp mo phn bd liu nunh dng Kmeans, lc ny dng Kmeadoids shiu quh n. Theo biu sosnh trn ta nhn thy d liu t nhiu. Tuy nhin, phpo t ng ng cacci t ng trong Kmedoids d ng nhcha c hiu qulm, do vy phntr m phn l p ng cha c cao.ci thin chnh xc phn l p, lun vna ra ph ng php sau:

    V i mi mu b phn l p sai trong mi cm, ta s a n vo cm thchh p (gisl cm A) nu thomniu kin:

    + Khong cch tn t i cm hin th i bng khong cch t i cm A+ Nhn l p cm A ging nhn l p ca mu + Nu thm mu ny vo cm A, tm cm khng thayi (hoc thayi

    mt khong cch epsilonb cho tr c).Thc nghim cho thy chnh xc phn l p c tng ln. V d

    mt sbdliu sau: (C: C; M i: M)Bng 4.4: K t quci thin cht l ng phn l p

    Tn DL Iris Wine Balance HabermanC 53 80 423 226K=4 M 54 89 447 226C 55 85 441 231K=6 M 57 85 459 231C 57 84 451 232K=8 M 61 86 475 233C 59 87 453 234K=10 M 65 90 477 237C 65 93 459 240

    K=16 M 77 102 483 249C 69 97 463 244K=20 M 85 110 487 257C 74 102 468 249K=25 M 95 120 492 267

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    114/119

    -113-

    4.4.2.2 Bi ton phn lo i

    Bng 4.5: K t quth nghim phn loi ca Kmeans v KmedoidsSmu phn

    loi ngT lphn loi

    ng (%)Tn dliu Smuma md ma md

    Breastcancel 500 318 480 63.6 96Haberman 306 179 115 58.4967 50.6536Iris 150 125 52 83.3333 34.6667Pima 768 532 504 69.2708 65.625Glass 214 93 72 43.4579 33.6449Soybean 47 32 22 68.0851 46.8085Wine 178 172 70 96.6292 39.3258Balance 625 313 336 50.08 53.76

    So snh Kmeans v Kmedoids

    0

    20

    40

    60

    80

    100

    120

    b r e a s

    t c a n c

    e l

    h a b e

    r m a n i r i s p i m

    a g l a

    s s

    s o y b

    e a n

    w i n e

    b a l a n

    c e

    Cc b d li u

    P h n

    t r m

    p h n

    l o

    i n g

    Kmeans

    Kmedoids

    Hnh 4.3: Biu so snh Kmeans v Kmedoids trong bi ton phn loi

    Bng 4.6: K t quth nghim phn loi ca Kmedoids v See5Smu phn

    loi ngT lphn loi

    ng (%)Tn dliu SmuSee5 md See5 md

    Breastcancel 400 391 344 97.75 86Haberman 306 236 115 77.12418 50.6536

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    115/119

    -114-

    Iris 150 106 52 83.3333 34.6667Pima 400 307 262 76.75 65.5Car 298 289 202 72.25 67.7852Balance 400 336 238 84 64.8501

    So snh Kmedoids v See5

    0

    20

    40

    60

    80

    100

    120

    B r e a

    s t c a n

    c e l

    H a b e

    r m a n I r i s

    P i m a C a

    r

    B a l a n

    c e

    Cc b d liu

    P h n

    t r m

    p h n

    l o

    i n g

    Kmedoids

    See5

    Hnh 4.4: Biu so snh Kmedoids v See5 trong bi ton phn loi

    Theo biu trn ta nhn thy hiu quphn loi ca See5 tt h n b i nc mt m hnh phn loi dng cy thc s hiu qu, m hnh ny hn ch c nhng nhnh phn nh nhiu nn cht l ng phn loi cao. Cn Kmedoidstuy x l c d liu kiu hn h p nhng cht l ng tnh t ng ngca cci t ng cha cao nn khnng phn loi km h n See5.4.5 K t lun

    Nhvy, sau khi tin hnh thc nghim trn mt sbdliu ca UCI ta

    nhn thy k t quphn l p, phn loi cc d liu c thuc tnh lin tc caKmeans tt h n so v i Kmedoids. V i d liu c thuc tnh hn h p, Kmeanskhng x l c. Kmedoids v i ph ng php tnht ng ng gia hai mudo Ducker (1965)xut, Kaufman v Rousseeuw ci tin (1990) xl cd liu ny v i chnh xc trn trung bnh v chi ph tnh ton lO(k (n-k )2).

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    116/119

    -115-

    i v i cc gi tr n v k l n, chi ph nhvy scao. Vy nn vic ci tin chnh xc v tc tnh ton l h ng pht trin sau ny.

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    117/119

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    118/119

    -117-

    chnh xc phn l p, phn loi ph thuc vo nhiu yu t nh chtl ng d liu, thut ton cit, ph ng php tnht ng ng ca cci t ng d liu. Ngoi ra, cc gi tr khuyt hay cc thuc tnh d thacng phn no lmnh h ng n chng. V vy h ng pht trin sau nyl x l cc gi tr khuyt, pht hin v loi bcc thuc tnh d tha, citin ph ng php tnht ng ng,... nhm nng cao cht l ng v tcphn l p, phn loi.

    Tin hnh cit v ti p tc nghin cu nhiu k thut khai ph d liuh n na, c bit l trin khai gii quyt cc bi ton cthtrong thc t.

  • 8/2/2019 NGHIN CU V CI T MT S GII THUT PHN CM, PHN LP

    119/119

    -118-

    TI LIU THAM KHO

    1. Anil K. Jain and Richard C. Dubes (1988), Algorithms for clustering data ,

    Prentice-Hall, Inc., USA.2. Ho Tu Bao (1998), Introduction to knowledge discovery and data mining .3. Jiawei Han and Micheline Kambel (2000), Data Mining: Concepts and

    Techniques , Morgan Kaufmann Publishers.4. Joydeep Ghosh (2003),Scalable Clustering , Chapter 10, pp. 247-278, Formal

    version appears in: The Handbook of Data Mining, Nong Ye (Ed).5. J.Ross Quinlan (1993),C4.5: Programs for Machine Learning , Morgan

    Kaufmann Publishers. 6. Mercer (2003),Clustering large datasets , Linacre College.7. Pavel Berkhin,Survey of Clustering Data Mining Techniques . Accrue

    Software, Inc., San Jose.