Upload
hosyky
View
216
Download
0
Embed Size (px)
Citation preview
8/3/2019 TM HIU GOM CM D LIU V H
1/23
TM HIU GOM CM D LIU
V H GII THUT K-MEAN
8/3/2019 TM HIU GOM CM D LIU V H
2/23
GOM CM D LIU
Gom cm d liu l mt tc v trong khaiph d liu.
Gom cm d liu gip ta c th h thng lid liu lm cho chng khng b ri rc.
Vi mt c s d liu ln v ri rc th vicgom cm rt cn thit v hu nh l khngth thiu.
8/3/2019 TM HIU GOM CM D LIU V H
3/23
MC CH CA GOM CM
Mc ch ca gom cm d liu l nhmkhm ph ra cu trc d liu thnh lpcc tp d liu t cc nhm d liu ln
8/3/2019 TM HIU GOM CM D LIU V H
4/23
YU CU CA GOM CM D LIU Gom cm d liu l lm cho cc d liu
trong cm th tng t nhau. Cn ccphn t khc cm th khng tng tnhau.
tng t gia cc cm d liu do ngidng nh ngha. c xc nh da trncc i tng thuc tnh m t i tng.Thng ta o khon cch gia cc itng.
8/3/2019 TM HIU GOM CM D LIU V H
5/23
8/3/2019 TM HIU GOM CM D LIU V H
6/23
YU CU CA GOM CM D LIU
Kh nng gom cm tng dn c lp vi dliu nhp
Kh nng x l d liu a chiu
Kh nng gom cm da trn rng buc Kh din v kh dng
8/3/2019 TM HIU GOM CM D LIU V H
7/23
PHN LOI CC PHNG PHP GOM CM Phn hoch (partitioning): cc phn hochcto
ra v nh gi theo mt tiu ch no .
Phn cp (hierarchical): phn r tpdliu/itng c thtphn cp theo mt tiu ch no .
Da trn mt (density-based): da trn
connectivity and density functions.
Da trn li (grid-based): da trn a multiple-levelgranularity structure.
Da trn m hnh (model-based): mt m hnh githuytca ra cho micm; sau hiuchnhcc thng s m hnh ph hpvicmdliu/itngnht.
8/3/2019 TM HIU GOM CM D LIU V H
8/23
PHNG PHP NH GI GOM CM D LIU nh gi ngoi (external validation)
nh gi ktqu gom cmda vo cu trc cchnhtrccho tpdliu
o : Rand statistic, Jaccard coefficient, Folkes and Mallowsindex
nh gi ni (internal validation)
nh gi ktqu gom cm theo slng cc vector ca chnh tpdliu (ma trngnproximity matrix)
o : :Huberts statistic, Silhouette index, Dunns index,
nh gi tngi (relative validation)
nh gi ktqu gom cmbngvic so snh cc ktqu gomcm khc ngvi cc btr thng s khc nhau
Tiu ch cho vicnh gi v chnktqu gom cmtiu- nn (compactness): cc itng trong cm nn gn nhau.
- phn tch (separation): cc cm nn xa nhau.
8/3/2019 TM HIU GOM CM D LIU V H
9/23
PHNG PHP NH GI GOM CM D LIU nh gi theo Entropy (trnh khi chtlng
gom cmtt)
ii
ij
ji
iji
ii
ij
ji
ij
in
n
n
n
n
n
p
p
p
ppIEntropy )log()log()(
8/3/2019 TM HIU GOM CM D LIU V H
10/23
CC VN CN GII QUYT BiuDinKiuDLiu
+ Ta ch quan tm nnhngkiu mcnthit cho vic gom cm m thi
+ Ta nhngha d(i,j) l khon cch
gia 2 itng i v j. d(i,j) 0 d(i,i) = 0
d(i,j) =d(j,i)
d(i,j)d(i,k) +d(k,j)vi k l mtimbt k khc i,j.
8/3/2019 TM HIU GOM CM D LIU V H
11/23
CC VN CN GII QUYT itng i,j cbiudinbi vector
x,y tngt(similarity) gia i v j dc
tnh theo cng thc
x = (x1, , xp)
y = (y1, , yp)
s(x, y) = (x1*y1 + + xp*yp)/((x12+ + xp2)1/2*(y12+ + yp2)1/2)
8/3/2019 TM HIU GOM CM D LIU V H
12/23
CC VN CN GII QUYT Interval-scaled variables/attributes
+ khonlch
+ khon cch
+ Z-score measurement
|)|...|||(|121 fnffffff
mxmxmxns
.)...21
1nffffxx(xnm
f
fif
if s
mx
z
8/3/2019 TM HIU GOM CM D LIU V H
13/23
CC VN CN GII QUYT Cc cng thc tnh okhon cch
+ okhong cch Minkowski
+ okhon cch Manhattan
+ okhon cch Euclidean
||...||||),(2211 pp j
xi
xj
xi
xj
xi
xjid
)||...|||(|),( 2222
2
11 pp jx
ix
jx
ix
jx
ixjid
8/3/2019 TM HIU GOM CM D LIU V H
14/23
CC VN CN GII QUYT Binary variables/attributes
Obj j
Obj ipdbcasum
dcdc
baba
sum
0
1
01
Hs so trng ngin (nuixng):
Hs so trng Jaccard (nubtixng):
dcbacbjid
),(
cbacbjid
),(
8/3/2019 TM HIU GOM CM D LIU V H
15/23
CC VN CN GII QUYT Variables/attributes of mixed types
)(1
)()(1),(
fij
pf
fij
fij
pf djid
Nu xifhoc xjfbthiu (missing) th
f (variable/attribute): binary (nominal)
dij(f) = 0 if xif= xjf , or dij
(f) = 1 otherwise
f: interval-scaled (Minkowski, Manhattan,
Euclidean)
f: ordinal or ratio-scaled
tnh ranks rifv
ziftrthnh interval-scaled1
1
f
if
Mr
zif
8/3/2019 TM HIU GOM CM D LIU V H
16/23
CC VN CN GII QUYT
1
1
f
if
Mr
zif
1
1
f
if
Mr
zif
8/3/2019 TM HIU GOM CM D LIU V H
17/23
NGHA CA VIC PHN CM
Phn cm ta c th i su vo phn tchnghin cu tng cm d liu nhm khmph v tm kim cc thng tin n nhm h
tr cho vic ra quyt nh
8/3/2019 TM HIU GOM CM D LIU V H
18/23
CC GII THUT GOM CM D LIU
Trong gom cm d liu c nhiu gii thut ,tiu biu l gii thut k-mean v gii thutgom cm phn cp nhm.
Chng ta s tm hiu gii thut K-Meantrong gom cm d liu
8/3/2019 TM HIU GOM CM D LIU V H
19/23
GII THUT K-MEANS INPUT: Mt CSDL gm n i tng v s cc cm k.
OUTPUT: Cc cm Ci (i=1,..,k) sao cho hm tiu chun E t gi tr ti
thiu. Bc 1: Khi to
Chn k i tng mj (j=1...k) l trng tm ban u ca k cm t tp dliu
(vic la chn ny c th l ngu nhin hoc theo kinh nghim).
Bc 2: Tnh ton khong cchi vi mi i tng Xi (1
8/3/2019 TM HIU GOM CM D LIU V H
20/23
GII THUT K-MEANS phc tp d liu c tnh l
O(n.k.d.t.T)Trong : n l s i tng d liu
k l s cm d liu
d l s chiut l s vng lp
T l thi gian tnh ton mt
php tnh c s nh : cng , tr, nhn hocchia.....
8/3/2019 TM HIU GOM CM D LIU V H
21/23
GII THUT K-MEANS u im :K-Means phn tch phn cm n
gin nn c th p dng vi tp d liu ln Nhc im: K-Means ch p dng vi d
liu c thuc tnh s v khm ph ra cc
cm c dng hnh cu, k-means cn rtnhy cm vi nhiu v cc phn t ngoi laitrong d liu. Ngoi ra cn ph thuc nhiuvo cc thng s u vo
8/3/2019 TM HIU GOM CM D LIU V H
22/23
GII THUT K-MEANS
Trong trng hp, cc trng tm khi to ban um qu lch so vi cc trng tm cm t nhin thkt qu phn cm ca k-means l rt thp, ngha lcc cm d liu c khm ph rt lch so vi cc
cm trong thc t. Trn thc t ngi ta cha cmt gii php ti u no chn cc tham s uvo, gii php thngc s dng nht l thnghim vi cc gi tr u vo k khc nhau ri sau
chn gii php tt nht.
8/3/2019 TM HIU GOM CM D LIU V H
23/23
GII THUT K-MEANS n nay, c rt nhiu thut ton k
tha t tng ca thut ton k-meansp dng trong khai ph d liu giiquyt tp d liu c kch thc rt lnang c p dng rt hiu qu v phbin nh thut ton k-medoid, PAM,CLARA, CLARANS, k- prototypes,