View
14
Download
0
Category
Preview:
DESCRIPTION
a
Citation preview
Khai Ph D Liu
Nguyn Nht Quang
quangnn-fit@mail.hut.edu.vn
Trng i hc Bch Khoa H NiVin Cng ngh Thng tin v Truyn thng
Nm hc 2011-2012
Ni dung mn hc:
Gii thiu v Khai ph d liu
Gii thiu v cng c WEKA
Tin x l d liu
Pht hin cc lut kt hp
Cc k thut phn lp v d on Cc k thut phn lp v d on
Cc k thut phn nhm
2Khai Ph D Liu
Tp d liup Mt tp d liu (dataset) l mt tp
hp cc i tng (objects) v cc Cc thuc tnhthuc tnh ca chng
Mi thuc tnh (attribute) m t mt c im ca mt i tng
Tid Refund Marital Status
Taxable Income Cheat
1 Y Si l 125K N
Cc thuc tnh
c im ca mt i tng Vd: Cc thuc tnh Refund, Marital
Status, Taxable Income, Cheat
Mt tp cc gi tr ca cc thuc
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 N Di d 95K YCc Mt tp cc gi tr ca cc thuc
tnh m t mt i tng Khi nim i tng cn c
tham chiu n vi cc tn gi khc:
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 N M i d 75K N
i tng
tham chiu n vi cc tn gi khc: bn ghi (record), im d liu (data point), trng hp (case), mu (sample), thc th (entity), hoc v
9 No Married 75K No
10 No Single 90K Yes 10
(Tan, Steinbach, Kumar -Introduction to Data Mining)
d (instance)
3Khai Ph D Liu
g)
Cc kiu tp d liup Bn ghi (Record)
Cc bn ghi trong csdl quan h Ma trn d liu Biu din vn bn (document) D liu giao dch
th (Graph) World Wide Web Mng thng tin, hoc mng x hi
TID Items
1 Bread, Coke, Milk
Cc cu trc phn t (Molecular structures) C trt t (Ordered)
D liu khng gian (vd: bn )
2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke Diaper Milk g g ( )
D liu thi gian (vd: time-series data) D liu chui (vd: chui giao dch) D liu chui di truyn (genetic sequence
5 Coke, Diaper, Milk
(Han, Kamber - Data Mining: Concepts and Techniques)
y (g qdata)
4Khai Ph D Liu
Cc kiu gi tr thuc tnhg Kiu nh danh/chui (norminal): khng c th t
Ly gi tr t mt tp khng c th t cc gi tr (nh danh) Ly gi tr t mt tp khng c th t cc gi tr (nh danh) Vd: Cc thuc tnh nh: Name, Profession,
Kiu nh phn (binary): l mt trng hp c bit ca Kiu nh phn (binary): l mt trng hp c bit ca kiu nh danh Tp cc gi tr ch gm c 2 gi tr (Y/N, 0/1, T/F)
Kiu c th t (ordinal): Ly gi tr t mt tp c th t cc gi tr
Vd1 C th t h l i t h A H i ht Vd1: Cc thuc tnh ly gi tr s nh: Age, Height, Vd2: Thuc tnh Income ly gi tr t tp {low, medium, high}
5Khai Ph D Liu
Kiu thuc tnh ri rc vs. lin tc Kiu thuc tnh ri rc (Discrete-valued attributes)
Tp cc gi tr l mt tp hu hn Tp cc gi tr l mt tp hu hn Bao gm c cc thuc tnh c kiu gi tr l cc s nguyn Bao gm c cc thuc tnh nh phn (binary attributes)
Kiu thuc tnh lin tc (Continuous-valued attributes) Cc gi tr l cc s thc (real numbers)
6Khai Ph D Liu
Cc c tnh m t d liu Mc ch: hiu r v d liu c c (chiu hng
chnh/trung tm s bin thin s phn b)chnh/trung tm, s bin thin, s phn b)
S phn b ca d liu (Data dispersion) Gi tr cc tiu/cc i (min/max)
Gi tr xut hin nhiu nht (mode)
Gi t t b h ( ) Gi tr trung bnh (mean)
Gi tr trung v (median)
S bin thin (variance) v lch chun (standard deviation) S bin thin (variance) v lch chun (standard deviation)
Cc ngoi lai (outliers)
7Khai Ph D Liu
Hin th ha d liu (Data visualization) Biu din d liu bng cc phng php hin th ha,
gip hiu r cc c im ca d liugip hiu r cc c im ca d liu
Cung cp ci nhn nh tnh i vi cc tp d liu ln
C th ch ra cc mu cc xu hng cc cu trc cc C th ch ra cc mu, cc xu hng, cc cu trc, cc bt thng, v cc quan h trong d liu
H tr xc nh cc vng d liu quan trng v cc thamH tr xc nh cc vng d liu quan trng v cc tham s ph hp cho cc phn tch nh lng tip theo
Trong mt s trng hp, c th cung cp cc chng minh trc quan i vi cc biu din (tri thc) thu c
8Khai Ph D Liu
D liu cn i vs. lch Gi tr trung bnh, gi tr trung
v, v gi tr xut hin nhiuv, v gi tr xut hin nhiu nht i vi D liu cn i
D liu lch D liu lch
9Khai Ph D Liu (Han, Kamber - Data Mining: Concepts and Techniques)
Biu histogramg Biu histogram l cch
biu din da trn thbiu din da trn th
c s dng rt ph binbin
Hin th cc m t thng k xut hink xut hin (counts/frequencies) theo mt thuc tnh no
(Han, Kamber - Data Mining: Concepts and Techniques)Concepts and Techniques)
10Khai Ph D Liu
th ri rc (Scatter plot) ( p ) Cho php hin th quan h 2 chiu (gia 2 thuc tnh) ca d liu Cho php quan st (trc quan) cc nhm im, cc ngoi li,p p q ( q ) , g , Mi cp gi tr ca 2 thuc tnh c xt tng ng vi 2 ta ca im c hin th trn mt phng
(
11Khai Ph D Liu
(Han, Kamber - Data Mining: Concepts and Techniques)
Tin x l d liu: Cc nhim v chnh Lm sch d liu (Data cleaning)
Gn cc gi tr thuc tnh cn thiu, Sa cha cc d liu nhiu/li, Xc g nh hoc loi b cc ngoi lai (outliers), Gii quyt cc mu thun d liu
Tch hp d liu (Data integration) Tch hp nhiu c s d liu, nhiu khi d liu (data cubes), hoc nhiu p ( )
tp tin d liu Bin i d liu (Data transformation)
Chun ha (normalize) v kt hp (aggregate) d liu Gim bt d liu (Data reduction)
Gim bt v biu din (cc thuc tnh) ca d liu, gim bt kch thc d liu nhng vn m bo thu c cc kt qu khai ph d liu tng ng (hoc xp x)
Ri rc ha d liu (Data discretization) L mt thao tc trong gim bt d liu c s dng i vi cc d liu c cc thuc tnh kiu s
12Khai Ph D Liu
Lm sch d liu (1)( ) Cc vn ca d liu?
D li th t th t th h hi li kh D liu thu c t thc t c th cha nhiu, li, khng hon chnh, c mu thun Khng hon chnh (incomplete): Thiu cc gi tr thuc tnh Khng hon chnh (incomplete): Thiu cc gi tr thuc tnh,
hoc thiu mt s thuc tnh Vd: salary =
Nhi /li ( i / ) Ch h li h d bt Nhiu/li (noise/error): Cha ng nhng li hoc cc v d bt thng (abnormal instances) Vd: salary = -525 (gi tr ca thuc tnh khng th l mt s m)
Mu thun (inconsistent): Cha ng cc mu thun (khng thng nht) Vd: salary = abc (khng ph hp vi kiu d liu s ca thuc tnh
salary)
13Khai Ph D Liu
Lm sch d liu (2)( ) Ngun gc/l do ca d liu khng sch?
Khng hon chnh (incomplete)Khng hon chnh (incomplete) Gi tr ca thuc tnh khng c (not available) ti thi im c
thu thp Cc vn gy ra bi phn cng phn mm hoc ngi thu Cc vn gy ra bi phn cng, phn mm, hoc ngi thu
thp d liu
Nhiu/li (noise/error) Do vic thu thp d liu Do vic nhp d liu Do vic truyn d liu y
Mu thun (inconsistent) D liu c thu thp t nhiu ngun khc nhau
Vi h b (i ki ) i i h h Vi phm cc rng buc (iu kin) i vi cc thuc tnh
14Khai Ph D Liu
Lm sch d liu (3)( ) Ti sao cn phi lm sch d liu?
Nu d liu khng sch (c cha li, nhiu, khng y , c mu thun), th cc kt qu khai ph d liu s b nh hng v khng ng tin cynh hng v khng ng tin cy
Cc kt qu khai ph d liu (cc tri thc khm ph c) khng chnh xc (khng ng tin cy) s dn nc) khng chnh xc (khng ng tin cy) s dn n cc quyt nh khng chnh xc, khng ti u Vd: Cc d liu cha li hoc thiu gi tr thuc tnh s c th
dn n cc kt qu thng k sai lmdn n cc kt qu thng k sai lm
15Khai Ph D Liu
Thiu gi tr thuc tnhg i vi mt s thuc tnh, gi tr ca chng i vi mt
s bn ghi khng cs bn ghi khng c Vd: Gi tr ca thuc tnh Income khng c (khng c ghi li) i vi mt s bn ghi
Thiu gi tr thuc tnh c th v: Li ca cc thit b phn cng Khng tng thch vi cc d liu c ghi t trc, do Khng tng thch vi cc d liu c ghi t trc, do
gi tr (mi) b xa i D liu khng c nhp vo (li ca ngi nhp liu)
C i h h hi hi (b Cc gi tr thuc tnh thiu cn phi c gn (bng mt c ch suy din) m bo tnh chnh xc ca cc kt qu khai ph d liuq p
16Khai Ph D Liu
Thuc tnh thiu gi tr: Cc gii phpg g p p B qua cc bn ghi c cc thuc tnh thiu gi tr
Thng c p dng trong cc bi ton phn lp (classification)g p g g p p ( ) Khng hiu qu, khi t l % cc gi tr thiu i vi cc thuc tnh
(rt) khc nhau Mt s ngi s m nhim vic kim tra v gn cc gi tr Mt s ngi s m nhim vic kim tra v gn cc gi tr
thuc tnh cn thiu ny (manually filling): cng vic t nht + chi ph caoGn gi tr t ng bi my tnh Gn gi tr t ng bi my tnh Mt gi tr (hng) mc nh Gi tr trung bnh ca thuc tnh Gi tr trung bnh ca thuc tnh , xt i vi tt c cc v d
(cc bn ghi) thuc cng lp (class) vi bn ghi Gi tr c th xy ra nht da trn phng php xc sut (vd: y g (
cng thc Bayes)
17Khai Ph D Liu
D liu cha nhiu Nhiu: Li ngu nhin i vi gi tr ca mt thuc tnh
Cc gi tr thuc tnh b li (nhiu) c th v: Li ca cc thit b thu thp d liu
Cc li khi nhp d liu
Li trong qu trnh truyn d liu
S mu thun (khng nht qun) trong quy c tn (thuc tnh/bin)
18Khai Ph D Liu
D liu cha nhiu: Cc gii phpg p p Phn khong (Binning)
Sp xp d liu v phn chia thnh cc khong (bins) c tn s Sp xp d liu, v phn chia thnh cc khong (bins) c tn s xut hin gi tr (frequency) nh nhau
Sau , mi khong d liu c th c biu din bng trung bnh(mean), trung v (median), hoc cc gii hnca cc gi trbnh(mean), trung v (median), hoc cc gii hnca cc gi tr trong khong
Hi quy (Regression)Gn d liu vi mt hm hi quy (regression function) Gn d liu vi mt hm hi quy (regression function)
Phn cm (Clustering) Pht hin v loi b cc ngoi lai (sau khi xc nh cc cm)
Kt hp gia my tnh v kim tra ca con ngi My tnh t ng pht hin cc gi tr nghi ng (l nhiu/li) Cc gi tr nghi ng ny s c con ngi kim tra li Cc gi tr nghi ng ny s c con ngi kim tra li
19Khai Ph D Liu
Phn khong (Binning)g ( g) Phn chia vi rng (khong cch) bng nhau
Chia khong gi tr thnh N khong vi kch thc ( rng) bng g g g ( g) gnhau
Nu mini v maxi l gi tr ln nht v nh nht ca thuc tnh, th kch thc ( rng) ca mi khong = (maxi - mini)/N( g) g ( i i)
Khng ph hp i vi cc tp d liu lch (skewed data), hoc c cha cc ngoi lai (outliers) v c th mt khong s ch cha mt (hoc mt s) cc ngoi lai ( ) g
Phn chia vi su (tn xut xut hin) bng nhau Chia khong gi tr thnh N khong (khng nht thit bng nhau), g g g ( g g )
sao cho mi khong cha xp x bng nhau s lng (tn xut xut hin) ca cc v d
Hiu qu hn cch phn chia vi rng (khong cch) bng q p g ( g ) gnhau
20Khai Ph D Liu
Phn khong (Binning) V dg ( g) Sp xp cc gi tr ca thuc tnh Price: 4, 8, 9, 15, 21,
21 24 25 26 28 29 3421, 24, 25, 26, 28, 29, 34
Phn chia thnh cc khong vi su (tn xut xut hin) bng nhauhin) bng nhau Bin 1: 4, 8, 9, 15 Bin 2: 21, 21, 24, 25
Bi 3 26 28 29 34 Bin 3: 26, 28, 29, 34
Biu din khong d liu bi gi tr trung bnhBi 1 9 9 9 9 Bin 1: 9, 9, 9, 9
Bin 2: 23, 23, 23, 23 Bin 3: 29, 29, 29, 29
21Khai Ph D Liu
Hi quy (Regression)q y ( g )y
Y1
y = x + 1Y1
xX1
(Han, Kamber - Data Mining: Concepts and Techniques)
22Khai Ph D Liu
Phn tch cc cm (Cluster analysis)( y )
(Han, Kamber - Data Mining: Concepts and Techniques)
23Khai Ph D Liu
Tch hp d liup Tch hp d liu (Data integration)
Kt hp d liu t nhiu ngun vo mt kho d liu thng nhtp g g
Tch hp mc m hnh (Schema integration) Tch hp metadata t cc ngun khc nhau Vd: A cust id B customID Vd: A.cust-id B.customID
Vn xc nh thc th ( trnh d tha d liu) Cn xc nh cc thc th (identities) trn thc t t nhiu ngun d liu
Vd Bill Cli t B Cli t Vd: Bill Clinton B. Clinton Pht hin v x l cc mu thun i vi gi tr d liu
i vi cng mt thc th trn thc t, nhng cc gi tr thuc tnh t nhiu ngun khc nhau li khc nhau. Cc l do c th:
Cc cch biu din khc nhau Mc nh gi, o (scales) khc nhau Vd: h o lng mt vs.
h l A hh o lng ca Anh
24Khai Ph D Liu
Tch hp d liu: X l d tha d liu D tha d liu (redundant data) thng xuyn xy ra, khi tch
hp d liu t nhiu ngun (vd: t nhiu csdl) nh danh i tng: Cng mt thuc tnh (hay cng mt i
tng) c th mang cc tn (nh danh) khc nhau trong cc csdl khc nhau
D liu suy ra c: Mt thuc tnh trong mt bng c th l mt thuc tnh c suy ra (derived attribute) trong mt bng khc Vd: Annual Revenue v Monthly Revenue
Cc thuc tnh d tha c th c pht hin bng phn tch tng quan (Correlation analysis): Pearson, Cosine, chi-square
Yu cu chung i vi qu trnh tch hp d liu: Gim thiu (trnh c l tt nht) cc d tha v cc mu thun Gip ci thin tc ca qu trnh khai ph d liu, v nng cao p q p , g
cht lng ca cc kt qu (tri thc) thu c
25Khai Ph D Liu
Bin i d liu (1)( ) Bin i d liu (Data transformation)
Vic chuyn (nh x) ton b tp gi tr ca mt thuc tnh sang mt tp y ( ) p g g pmi cc gi tr thay th, sao cho mi gi tr c tng ng vi mt trong cc gi tr mi
Cc phng php bin i d liup g p p Lm trn (Smoothing): Loi b nhiu/li khi d liu Kt hp (Aggregation): S tm tt d liu, xy dng cc khi d liu
(data cubes) Khi qut ha (Generalization): Xy dng cc phn cp khi nim
(concept hierarchies) Chun ha (Normalization): a cc gi tr v mt khong c ch nh
Chun ha min-max Chun ha z-score Chun ha bi thang chia 10
Xy dng (to nn) cc thuc tnh mi da trn cc thuc tnh ban u
26Khai Ph D Liu
Bin i d liu (2)( ) Chun ha min-max: thnh khong [new_mini, new_maxi]
old minv
Chun ha z-score
iiiii
inew minnewminnewmaxnewminmaxminvv _)__( +
=
Chun ha z-score i, i: gi tr trung bnh v lch chun i vi thuc tnh i
iold
new vv =
Chun ha bi thang chia 10
i
ld
j l gi tr s nguyn nh nht sao cho: max({vnew}) < 1
j
oldnew vv
10=
j l gi tr s nguyn nh nht sao cho: max({v }) 1
27Khai Ph D Liu
Gim bt d liu Ti sao cn phi gim bt d liu?
Mt kho (tp) d liu ln c th cha lng d liu ln n terabytes Do , qu trnh khai ph d liu c th s chy rt lu (rt mt thi gian)
i vi ton b tp d liu
Gim bt d liu (Data reduction) thu c mt biu din thu gn (gim bt) nhng vn sinh ra cng
(hoc xp x) cc kt qu phn tch (khai ph) nh vi tp d liu ban u
Cc chin lc gim bt d liu g Gim s chiu (Dimensionality reduction): loi b bt cc thuc tnh
khng (t) quan trng Gim lng d liu (Data/Numerosity reduction)
Kt hp khi d liu (Data cube aggregation) Nn d liu (Data compression) Hi quy (Regression) Ri rc ha (Discretization)
28Khai Ph D Liu
Gim s chiu nh hng tiu cc ca s chiu (s thuc tnh) ln
Khi s chiu tng, d liu tr nn tha tht hn (more sparse)g ( p ) Mt v khong cch gia cc im (quan trng i vi vic
phn cm, pht hin ngoi lai) tr nn t c ngha Gim s chiu (Dimensionality reduction) Gim s chiu (Dimensionality reduction)
Trnh (gim bt) nh hng tiu cc ca s chiu ln Gip loi b cc thuc tnh khng lin quan, v gim nhiu/li
Gi i hi h thi i b h h t h kh i Gip gim chi ph v thi gian v b nh cn cho qu trnh khai ph d liu
Cho php hin th ha (visualize) d liu mt cch d dng v hi hhiu qu hn
Cc k thut gim s chiu Phn tch thnh phn chnh (Principal component analysis)( y ) La chn tp con cc thuc tnh (Feature subset selection)
29Khai Ph D Liu
Phn tch thnh phn chnh (1)p ( ) Phn tch thnh phn chnh
(Principal component x2( p panalysis PCA) Tm mt php chiu
(projection) khng gian
x2
e(projection) khng gian thuc tnh mi sao cho gi c mc ti a v s khc bit (variation) trong tp
e
( ) g pd liu ban u
Tm cc eigenvectors ca ma trn hip bin cc peigenvectors ny s nh ngha khng gian thuc tnh mi
x1(Han, Kamber - Data Mining: Concepts and Techniques)
30Khai Ph D Liu
Phn tch thnh phn chnh (2)p ( ) Mi v d (bn ghi) s c biu din bi n chiu (thuc tnh) Mc ch: Tm k (n) vect trc giao (s l cc thnh phn chnh
principal components) biu din tp d liu ban u ph hp nht1) Chun ha d liu u vo: Cc gi tr cho cc thuc tnh c a v
cng mt khong gi tr2) Tnh k vect trc giao (chnh l cc thnh phn chnh)
3) Mi vect d liu u vo s l mt kt hp tuyn tnh ca k vectthnh phn chnh ny
4) C th h h h h th i d 4) Cc thnh phn chnh c sp xp theo mc gim dn v quantrng
5) Kch thc ca d liu c gim bt, bng cch loi b cc thnhphn (vect) c mc quan trng thp cc vect ny tng ng viphn (vect) c mc quan trng thp cc vect ny tng ng vi khc bit (variance) thp
6) S dng cc vect c mc quan trng cao nht s cho php biudin xp x tp d liu ban u
Phng php PCA ch p dng c vi d liu kiu s
31Khai Ph D Liu
La chn tp con cc thuc tnhp Vi d thuc tnh ban u, c th c n 2d kh nng la chn
mt tp con cc thuc tnh Cc phng php thng c p dng cho vic la chn tp
con cc thuc tnh (Feature subset selection) La chn cc thuc tnh ring r (vi gi s l cc thuc tnh l La chn cc thuc tnh ring r (vi gi s l cc thuc tnh l c lp vi nhau) Theo mt (hoc mt s) tiu ch nh gi
La chn thuc tnh tng bc (Step wise feature selection) La chn thuc tnh tng bc (Step-wise feature selection) Thuc tnh tt nht s c chn ra u tin Chn thuc tnh tt nht tip theo i vi thuc tnh u tin
hchn Loi b thuc tnh tng bc (Step-wise feature elimination)
Loi b dn dn (repeatedly) cc thuc tnh km (ti) nht Kt hp ng thi 2 chin lc: la chn v loi b cc thuc tnh
32Khai Ph D Liu
Kt hp khi d liu (Data cube aggregation)
Mc thp nht ca mt khi d liu (basic cuboid) L d liu c kt hp li i vi mt thc th (individual entity) L d liu c kt hp li i vi mt thc th (individual entity) c quan tm
Vd: Mt khch hng trong mt kho d liu mua hng
C k h kh h khi d li Cc mc kt hp khc nhau trong cc khi d liu Gip gim nh hn na kch thc ca d liu cn x l
Cc mc kt hp ph hp Cc mc kt hp ph hp S dng biu din ngn gn (nh) nht gii quyt yu cu
(truy vn thng tin) t ra
Cc cu tm kim (queries) i vi cc thng tin c kt hp (aggregated information) nn c tr li bng cch s dng cc khi d liu
33Khai Ph D Liu
Ly mu d liuy Ly mu d liu (Data sampling) l phng php quan
trng i vi vic la chn d liutrng i vi vic la chn d liu
Vic ly mu d liu l cn thit v yu cu thu thp v x l ton b mt tp d liu ln s i hi chi ph cao vx l ton b mt tp d liu ln s i hi chi ph cao v tn thi gian
Cc nguyn tc quan trng ca vic ly mu d liuCc nguyn tc quan trng ca vic ly mu d liu S dng mt mu (sample) s c tc dng gn nh s dng ton
b tp d liu, nu nh mu i din cho tp d liu Mt mu c gi l i din cho mt tp d liu, nu mu c
(xp x) c tnh ca tp d liu
34Khai Ph D Liu
Cc phng php ly mu d liup g p p y Ly mu ngu nhin (Simple random sampling)
Mi v d (bn ghi) c la chn vi mt gi tr xc sut nh Mi v d (bn ghi) c la chn vi mt gi tr xc sut nh nhau
Ly mu khng thay th (Sampling without replacement) Khi mt v d (bn ghi) c ly mu, n s c loi khi tp d
liu ban u (s khng th c chn thm mt ln no na)
Ly mu c thay th (Samping with replacement)Ly mu c thay th (Samping with replacement) Khi mt v d (bn ghi) c ly mu, n khng b loi khi tp
d liu ban u (c th c chn nhiu hn mt ln)
L h t (St tifi d li ) Ly mu phn tng (Stratified sampling) Phn chia tp d liu thnh cc phn (partitions) Ly ngu nhin cc v d t mi phny g p
35Khai Ph D Liu
Recommended