46
 0 Báo Cáo Môn Khai Phá Dữ  Liu Đề tài: Nghiên c u lớ p bài toán lut kết hợ p trong lãnh vự c khai phá dữ  liu. Nghiên cu cơ sở  lý thuyết h gii thut Apriori. Viết chơng trình Demo GVHD: Trơng Quang Hi Nhóm thc hin: - Phm Nht Trí 51004200 - Nguyn Bình Long 51004190 - Phm Nguyn Đc Dơng 50900483

Report Data Mining Libre

Embed Size (px)

DESCRIPTION

dsadasdas

Citation preview

  • 0

    Bo Co Mn Khai Ph D Liu

    ti: Nghin cu lp bi ton lut kt hp trong lnh vc khai ph d liu. Nghin cu c s l thuyt

    h gii thut Apriori. Vit chng trnh Demo

    GVHD: Trng Quang Hi Nhm thc hin:

    - Phm Nht Tr 51004200 - Nguyn Bnh Long 51004190 - Phm Nguyn c Dng 50900483

  • 1

    Mc lc Trang Gii thiu.................................................................................................................... 3 Chng 1. Tng Quan V Khai Ph D Liu ............................................................ 4

    1.1. Khai ph d liu ............................................................................................... 4 1.1.1. Khi nim ................................................................................................... 4 1.1.2. Cc bc trong qu trnh khai ph ............................................................ 4 1.1.3. ng dng ca khai ph d liu .................................................................. 6

    1.2. Tin x l d liu ............................................................................................. 6 1.2.1. D liu ....................................................................................................... 6 1.2.2. Lm sch d liu (data cleaning) ............................................................... 8 1.2.3. Tch hp d liu (data integration) ............................................................ 9 1.2.4. Bin i d liu (data transformation) ....................................................10 1.2.5. Thu gim d liu (data reduction) ...........................................................11

    1.3. Phng php D bo .....................................................................................11 1.3.1. Gii thiu D bo ....................................................................................11 1.3.2. Tng quan Hi qui ...................................................................................12 1.3.3. Hi qui tuyn tnh ....................................................................................12 1.3.4. Hi qui phi tuyn .....................................................................................13

    1.4. Phng php Phn loai ..................................................................................14 1.4.1. Gii thiu Phn loi .................................................................................14 1.4.2. Phn loi d liu vi cy quyt nh .......................................................14 1.4.3. Phn loi d liu vi mng Bayesian ......................................................17 1.4.4. Phn loi d liu vi mng Neural ..........................................................17

    1.5. Phng php Gom cm .................................................................................18 1.5.1. Gii thiu Gom cm ................................................................................18 1.5.2. Phng php phn cp .............................................................................19 1.5.3. Phng php phn hoch .........................................................................20

    1.6. Phng php khai ph lut kt hp ...............................................................21

  • 2

    1.6.1. Gii thiu lut kt hp .............................................................................21 1.6.2. Pht hin lut kt hp ..............................................................................22 1.6.3. Cc chin lc sinh tp thng xuyn ....................................................25 1.6.4. Gii thut FP-Growth ..............................................................................25

    Chng 2. ng dng ca khai ph d liu ..............................................................27 2.1. H tr ra quyt nh nhp kho trong siu th .................................................27

    2.1.1. Gii thiu v bi ton ..............................................................................27 2.1.2. nh gi ca thy sau khi gii thiu v bi ton .....................................28

    2.2.Tip th cho ...................................................................................................28 2.2.1.Gii thiu v bi ton................................................................................28

    Chng 3. Gii thut Apriori ...................................................................................29 3.1. Gii thut Apriori ...........................................................................................29 3.2. nh gi gii thut Apriori ............................................................................33 3.3. Cc ci tin ca gii thut Apriori .................................................................34

    Chng 4. Demo gii thut Apriori .........................................................................34 4.1. Hin thc gii thut Apriori ...........................................................................34 4.2. Hng dn s dng demo ..............................................................................38

    4.2.1.Ci t mi trng ....................................................................................38 4.2.2. ng dng .................................................................................................39

    Chng 5. nh gi tng kt ...................................................................................44 5.1. u im .........................................................................................................44 5.2. Nhc im ....................................................................................................44

    Ti liu tham kho ....................................................................................................45

  • 3

    Gii thiu Bi bo co ny c son ra tng hp li nhng vn m nhm chng em tm hiu trong qu trnh bo co hng tun. V phn cu trc th nhm chng em s ti cu trc li ni dung bo co, chng em khng son da theo ni dung bo co hng tun m s trnh by da theo ni dung dng mt khi thng nht va gip chng em n li kin thc mt cch c h thng, bn cnh cn gip qu trnh theo di v ni dung d dng hn. Bo co s gii thiu v khu tin x l lm sch d liu trc khi tin hnh khai ph, sau bn gii thut c hc trong chng trnh s c gii thiu, tip n l gii thiu mt s ng dng m khai ph d liu c th p dng trong thc t, tip n l gii thiu v gii thut Apriori v ci tin ca n, v cui cng l gii thiu v Demo gii thut Apriori. Trong qu trnh son do gii hn v thi gian v nng lc nn d c gng ht sc th cng khng trnh khi sai st, mong thy thng cm.

  • 4

    Chng 1. Tng Quan V Khai Ph D Liu Trong chng ny, trnh by cc khi nim ca khai ph d liu, cc bc ca qu trnh khm ph v ng dng ca n. Tip theo l thao tc u tin vi d liu - Tin x l d liu. Sau l 4 phng php khai ph d liu v cc thut gii ca chng. 1.1. Khai ph d liu 1.1.1. Khi nim Khai ph d liu (data mining) hay Khm ph tri thc t d liu (knowledge discovery from data) l vic trch rt ra c cc mu hoc tri thc quan trng (khng tm thng, n, cha c bit n v c th hu ch) th mt lng d liu (rt) ln. Cc tn gi khc: - Khm ph tri thc trong cc c s d liu (Knowledge discovery in databases KDD). - Trch rt tri thc (knowledge extraction). - Phn tch mu/d liu (data/pattern analysis). - 1.1.2. Cc bc trong qu trnh khai ph Qu trnh c thc hin qua 9 bc: 1- Tm hiu lnh vc ca bi ton (ng dng): Cc mc ch ca bi ton, cc tri thc c th ca lnh vc. 2- To nn (thu thp) mt tp d liu ph hp. 3- Lm sch v tin x l d liu. 4- Gim kch thc ca d liu, chuyn i d liu: Xc nh thuc tnh quan trng, gim s chiu (s thuc tnh), biu din bt bin. 5- La chn chc nng khai ph d liu: Phn loi, gom cm, d bo, sinh ra cc lut kt hp. 6- La chn/ Pht trin (cc) gii thut khai ph d liu ph hp.

  • 5

    7- Tin hnh khai ph d liu. 8- nh gi mu thu c v biu din tri thc: Hin th ha, chuyn i, b i cc mu d tha, 9- S dng tri thc c khai ph. Qu trnh khm ph tri thc theo cch nhn ca gii nghin cu v cc h thng d liu v kho d liu v qu trnh khm ph tri thc.

    Hnh 1.1.2_Qu trnh khai ph tri thc. Chun b d liu (data preparation), bao gm cc qu trnh lm sch d liu (data cleaning), tch hp d liu (data integration), chn d liu (data selection), bin i d liu (data transformation). Khai thc d liu (data mining): xc nh nhim v khai thc d liu v la chn k thut khai thc d liu. Kt qu cho ta mt ngun tri thc th.

    nh gi (evaluation): da trn mt s tiu ch tin hnh kim tra v lc ngun tri thc thu c.

    Trin khai (deployment).

  • 6

    Qu trnh khai thc tri thc khng ch l mt qu trnh tun t t bc u tin n bc cui cng m l mt qu trnh lp v c quay tr li cc bc qua. 1.1.3. ng dng ca khai ph d liu Kinh t - ng dng trong kinh doanh, ti chnh, tip th bn hng, bo him, thng mi, ngn hng, a ra cc bn bo co giu thng tin; phn tch ri ro trc khi a ra cc chin lc kinh doanh, sn xut; phn loi khch hng t phn nh th trng, th phn; Khoa hc: Thin vn hc d on ng i cc thin th, hnh tinh, ; Cng ngh sinh hc tm ra cc gen mi, cy con ging mi, ; Web: cc cng c tm kim. 1.2. Tin x l d liu Qu trnh tin x l d liu, u tin phi nm c dng d liu, thuc tnh, m t ca d liu thao tc. Sau tip hnh 4 giai on chnh: lm sch, tch hp, bin i, thu gim d liu. 1.2.1. D liu a) Tp d liu - Mt tp d liu (dataset) l mt tp hp cc i tng (object) v cc thuc tnh ca chng. - Mi thuc tnh (attribute) m t mt c im ca mt i tng.

    Hnh 1.2.1_V d dataset

    V d: Cc thuc tnh Refund, Marital Status , Taxable Income, Cheat

  • 7

    b) Cc kiu tp d liu - Bn ghi (record): Cc bn ghi trong c s d liu quan h. Ma trn d liu. Biu din vn bn. Hay d liu giao dch. - th (graph): World wide web. Mng thng tin, hoc mng x hi - D liu c trt t: D liu khng gian (v d: bn ). D liu thi gian (v d: time-series data). D liu chui (v d: chui giao dch). c) Cc kiu gi tr thuc tnh: - Kiu nh danh/chui (norminal): khng c th t. V d: Cc thuc tnh nh : Name, Profession, - Kiu nh phn (binary): l mt trng hp c bit ca kiu nh danh. Tp cc gi tr ch gm c 2 gi tr (Y/N, 0/1, T/F). - Kiu c th t (ordinal): Integer, Real, -ly gi tr t mt tp c th t gi tr. V d: Cc thuc tnh ly gi tr s nh : Age, Height , Hay ly mt tp xc nh, thuc tnh Income ly gi tr t tp {low, medium, high}. Kiu thuc tnh ri rc (discrete-valued attributes): c th l tp cc gi tr ca mt tp hu hn. Bao gm thuc tnh c kiu gi tr l cc s nguyn, nh phn.

    Kiu thuc tnh lin tc (continuous-valued attributes):Cc gi tr ls thc. d) Cc c tnh m t ca d liu: - Gip hiu r v d liu c c: chiu hng chnh/trung tm, s bin thin, s phn b. - S phn b ca d liu (data dispersion):

    + Gi tr cc tiu/cc i (min/max). + Gi tr xut hin nhiu nht (mode). + Gi tr trung bnh (mean). + Gi tr trung v (median). + S bin thin (variance) v lch chun (standard deviation) .

    + Cc ngoi lai (outliers).

  • 8

    1.2.2. Lm sch d liu (data cleaning) i vi d liu thu thp c, cn xc nh cc vn nh hng l cho n khng sch. Bi v, d liu khng sch (c cha li, nhiu, khng y , c mu thun) th cc tri thc khm ph c s b nh hng v khng ng tin cy, s dn n cc quyt nh khng chnh xc. Do , cn gn cc gi tr thuc tnh cn thiu; sa cha cc d liu nhiu/li; xc nh hoc loi b cc ngoi lai (outliers); gii quyt cc mu thun d liu. a) Cc vn ca d liu Trn thc th d liu thu c th cha nhiu, li, khng hon chnh, c mu thun. - Khng hon chnh (incomplete): Thiu cc gi tr thuc tnh hoc thiu mt s thuc tnh. V d: salary = . - Nhiu/li (noise/error): Cha ng nhng li hoc cc mang cc gi tr bt thng. V d: salary = -525 , gi tr ca thuc tnh khng th l mt s m. - Mu thun (inconsistent): Cha ng cc mu thun (khng thng nht). V d: salary = abc , khng ph hp vi kiu d liu s ca thuc tnh salary. b) Ngun gc/l do ca d liu khng sch - Khng hon chnh (incomplete): Do gi tr thuc tnh khng c (not available) ti thi im c thu thp. Hoc cc vn gy ra bi phn cng, phn mm, hoc ngi thu thp d liu. - Nhiu/li (noise/error): Do vic thu thp d liu, hoc vic nhp d liu, hoc vic truyn d liu. - Mu thun (inconsistent): Do d liu c thu thp c ngun gc khc nhau. Hoc vi phm cc rng buc (iu kin) i vi cc thuc tnh. c) Gii php khi thiu gi tr ca thuc tnh - B qua cc bn ghi c cc thuc tnh thiu gi tr. Thng p dng trong cc bi ton phn lp. Hoc khi t l % cc gi tr thiu i vi cc thuc tnh qu ln.

  • 9

    - Mt s ngi s m nhim vic kim tra v gn cc gi tr thuc tnh cn thiu, nhng i hi chi ph cao v rt t nht. - Gn gi tr t ng bi my tnh: + Gn gi tr mc nh + Gn gi tr trung bnh ca thuc tnh . + Gn gi tr c th xy ra nht da theo phng php xc sut. d) Gii php khi d liu cha nhiu/li - Phn khong (binning): Sp xp d liu v phn chia thnh cc khong (bins) c tn s xut hin gi tr nh nhau. Sau , mi khong d liu c th c biu din bng trung bnh, trung v, hoc cc gii hn ca cc gi tr trong khong . - Hi quy (regression): Gn d liu vi mt hm hi quy. - Phn cm (clustering): Pht hin v loi b cc ngoi lai (sau khi xc nh cc cm). - Kt hp gia my tnh v kim tra ca con ngi: My tnh s t ng pht hin ra cc gi tr nghi ng. Cc gi tr ny s c con ngi kim tra li. 1.2.3. Tch hp d liu (data integration) Tch hp d liu l qu trnh trn d liu t cc ngun khc nhau vo mt kho d liu c sn cho qu trnh khai ph d liu. Khi tch hp cn xc nh thc th t nhiu ngun d liu trnh d tha d liu. V d: Bill Clinton B.Clinton. Vic d tha d liu l thng xuyn xy ra, khi tch hp nhiu ngun. Bi cng mt thuc tnh (hay cng mt i tng) c th mang cc tn khc nhau trong cc ngun (c s d liu) khc nhau. Hay cc d liu suy ra c nh mt thuc tnh trong mt bng c th c suy ra t cc thuc tnh trong bng khc. Hay s trng lp cc d liu. Cc thuc tnh d tha c th b pht hin bng phn tch tng quan gia chng.

  • 10

    Pht hin v x l cc mu thun i vi gi tr d liu: i vi cng mt thc th trn thc t, nhng cc gi tr thuc tnh t nhiu ngun khc nhau li khc nhau. C th cch biu din khc nhau, hay mc nh gi, do khc nhau. Yu cu chung i vi qu trnh tch hp l gim thiu (trnh c l tt nht) cc d tha v cc mu thun. Gip ci thin tc ca qu trnh khai ph d liu v nng cao cht lng ca cc kt qu tri thc thu c. 1.2.4. Bin i d liu (data transformation) Bin i d liu l vic chuyn ton b tp gi tr ca mt thuc tnh sang mt tp cc gi tr thay th, sao cho mi gi tr c tng ng vi mt trong cc gi tr mi. Cc phng php bin i d liu: - Lm trn (smoothing): Loi b nhiu/li khi d liu. - Kt hp (aggregation): S tm tt d liu, xy dng cc khi d liu. - Khi qut ha (generalization): Xy dng cc phn cp khi nim. - Chun ha (normalization): a cc gi tr v mt khong c ch nh. + Chun ha min-max, gi tr mi nm khong [new_mini , new_maxi]

    + Chun ha z-score, vi i , i : gi tr trung bnh v lch chun ca thuc tnh i

    + Chun ha bi thang chia 10, vi j l gi tr s nguyn nh nht sao cho: max({vnew})

  • 11

    1.2.5. Thu gim d liu (data reduction) Mt kho d liu ln c th cha lng d liu ln n terabytes s lm cho qu trnh khai ph d liu chy rt mt thi gian, do nn thu gim d liu. Vic thu gim d liu s thu c mt biu din thu gn, m n vn sinh ra cng (hoc xp x) cc kt qu khai ph nh tp d liu ban u. Cc chin lc thu gim: - Gim s chiu (dimensionality reduction), loi b bt cc thuc tnh khng (t) quan trng. - Gim lng d liu (data/numberosity reduction) + Kt hp khi d liu. + Nn d liu. + Hi quy. + Ri rc ha. 1.3. Phng php D bo 1.3.1. Gii thiu D bo Bi ton d bo dng da vo thng tin lin quan n ngi mua hng (thu nhp, trnh ..), hay cc mt hng m khch hng mua, tin hnh a ra d bo v nhng la chn m c kh nng cao s xy ra tip theo. Vd: mt khch hng mua mt chic my tnh xch tay, th ngi bn hng s gi v mt s phin bn h iu hnh, phn mm dit virus, ng dng vn phng. cho khch hng xem xt. C 4 phng php c hc trong chng trnh (hi qui, phn loi, gom cm, khai ph lut kt hp) c th dng d bo c. trong mc ny ch gii thiu v hi qui d liu, ba gii thut cn li (phn loi, gom cm, khai ph lut kt hp) s c gii thiu cc mc sau. K thut d bo khi dng vi hi qui c dng d bo cc gi tr (s) lin tc. Cn 3 k thut cn li c dng d bo cc gi tr (s) ri rc.

  • 12

    1.3.2. Tng quan Hi qui a) Khi nim * Hi qui l k thut thng k cho php d on cc tr (s) lin tc. J.Han et al(2001, 2006). * Hi qui (Phn tch hi quy regression analysis) l k thut thng k cho php c lng cc mi lin kt gia cc bin. Wiki(2009) * Hi qui (Phn tch hi quy) l k thut thng k trong lnh vc phn tch d liu v xy dng cc m hnh t thc nghim, cho php m hnh hi qui va c khm ph c dng cho mc ch d bo (prediction), iu khin (control), hay hc (learn) c ch to ra d liu. R.D.Snee(1977) b) M hnh Hi qui (regression model):

    M hnh m t mi lin kt (relationship) gia mt tp cc bin d bo (predictor variables/independent variables) v mt hay nhiu p ng (responses/dependent variables). c) Phn loi: - Hi qui tuyn tnh (linear) v phi tuyn (nonlinear). - Hi qui n bin (single) v a bin (multiple). - Hi qui c thng s (parametric), phi thng s (nonparametric), v thng s kt hp (semiparametric). - Hi qui i xng (symmetric) v bt i xng (asymmetric). 1.3.3. Hi qui tuyn tnh Hi qui tuyn tnh gm hi qui tuyn tnh n bin v hi qui tuyn tnh a bin. * Gii thiu v hi qui tuyn tnh n bin Dng tng qut: y = w0 + w1x

    Trong x l bin on trc (predictor variable), y l gi tr c on ra vi gi tr x tng ng (response variable). xc nh gi tr w0 v w1, ta s dng

  • 13

    cng thc bnh phng ti thiu c mt ng thng thch hp nht. Cch tnh w0 v w1 nh sau:

    V d: Hi qui tuyn tnh n bin vi data

    Hnh 1.3.2_V d hi qui tuyn tnh n bin 1.3.4. Hi qui phi tuyn Dng tng qut: Yi = b0 + b1Xi1 + b2Xi2 + + bkXik Trong : i = 1..n vi n l s i tng quan st k = s bin c lp (s thuc tnh/tiu ch/yu t) Y = bin ph thuc X = bin c lp

    b0 k = tr ca cc h s hi qui

  • 14

    1.4. Phng php Phn loai 1.4.1. Gii thiu Phn loi Phn loi d liu l dng phn tch d liu nhm rt trch cc m hnh m t cc lp d liu hoc d on xu hng d liu.

    Qu trnh gm hai bc: - Bc hc (giai on hun luyn): xy dng b phn loi (classifier) bng

    vic phn tch/hc tp hun luyn. - Bc phn loi (classification): phn loi d liu/i tng mi nu

    chnh xc ca b phn loi c nh gi l c th chp nhn c (acceptable). Cc gii thut phn loi d liu: - Phn loi d liu vi cy quyt nh (decision tree). - Phn loi d liu vi mng Bayesian. - Phn loi d liu vi mng neural. - Phn loi d liu vi k phn t gn nht (k-nearest neighbor). - Phn loi d liu vi suy din da trn tnh hung (case-based reasoning). - Phn loi d liu da trn tin ha gen (genetic algorithms). - Phn loi d liu vi l thuyt tp th (rough sets). - Phn loi d liu vi l thuyt tp m (fuzzy sets). 1.4.2. Phn loi d liu vi cy quyt nh Cy quyt nh (decision tree) l mt m hnh dng phn loi d liu gm c: - Node ni: cha gi tr trn mt thuc tnh cho qu trnh thc hin php kim th. - Node l: cha nhn (label) hoc m t ca mt lp (class label). - Nhnh t mt node ni: kt qu ca mt php th trn thuc tnh tng ng.

  • 15

    Hnh 1.4.2_Mt v d v cy quyt nh Gii thiu mt s o: - Information Gain (c dng trong ID3)

    Trong : Info(D): Lng thng tin cn phn loi mt phn t D. Pi: xc sut mt phn t bt k trong D thuc v lp Ci , vi i = 1..m.

    - Gain Ratio (c dng trong C4.5)

    - Gini Index (c dng trong CART)

  • 16

    Gii thut xy dng cy quyt nh: Mt s gii thut xy dng cy quyt nh nhu ID3, C4.5, CART (Classification and Regression Trees). Gii thut tng qut xy dng cy quyt nh t Training Data

  • 17

    1.4.3. Phn loi d liu vi mng Bayesian Phn loi d liu vi mng Bayes l vic s dng phn loi da trn xc sut c iu kin do Bayes tm ra. Cng thc xc sut c iu kin c dng:

    1.4.4. Phn loi d liu vi mng Neural c m phng da theo mng Neural trong no b. c xy dng bng cch lp li vic hc mt tp hp c trng s cc d on v mt lp cc nhn da vo trng s. Thng c hin thc bng gii thut backpropagation. Gm c input layer, mt hoc nhiu layers n, v output layer. D liu c a vo input layer, da vo trng s di chuyn n cc neural thch hp trong hidden layer v cui cng l ra output layer tr v kt qu.

    Hnh 1.4.4_ Minh ha cho dng tng qut ca mng Neural Gii thut lan truyn ngc (backpropagation)

  • 18

    1.5. Phng php Gom cm 1.5.1. Gii thiu Gom cm Gom cm d liu: Vic nhm mt tp cc i tng c cng c im ging nhau hay gn ging nhau vo cng mt nhm. Cc i tng trong cng mt cm tng t vi nhau hn so vi i tng cm khc. Phng php gom cm h tr giai on tin x l d liu, m t s phn b d liu/i tng, Cc phng php gom cm tiu biu: - Phn hoch (partitioning): cc phn hoch c to ra v nh gi theo mt tiu ch no .

  • 19

    - Phn cp (hierarchical): phn r tp d liu/i tng c th t phn cp theo mt tiu ch no . - Da trn mt (density-based): da trn connectivity and density functions.

    - Da trn li (grid-based): da trn a multiple-level granularity structure. - Da trn m hnh (model-based): mt m hnh gi thuyt c a ra cho mi cm; sau hiu chnh cc thng s m hnh ph hp vi cm d liu/i tng nht. - 1.5.2. Phng php phn cp Cy cc cm: dng biu din phn cp cm . Vi cc l ca cy biu din tng i tng v cc nt trung gian v gc biu din cc cm. To cy phn cp t trn xung: T cm ln nht cha tt c i tng. Chia thnh cm nh hn, n khi c n cm tha mn iu kin dng.

    Hnh 1.5.3_To cy phn cp t trn xung To cy phn cp t di ln: - To n nhm, mi nhm gm mt i tng v lp mt ma trn khong cch cp n. - Tm 2 nhm u, v c khong cch nh nht. - Gp 2 nhm u,v thnh nhm uv v lp ma trn khong cch mi cho uv

  • 20

    - Lp li qu trnh n khi cn 1 nhm 1.5.3. Phng php phn hoch Vi tp d liu cha n i tng, to phn hoch thnh tp c k cm sao cho: - Mi cm c t nht 1 i tng. - Mi i tng thuc v 1 cm duy nht. - Tm phn hoch c k cm sao ti u ha cc tiu chun phn hoc c chn. Thut ton k-mean: 1- Phn hoch i tng thnh k cm ngu nhin. 2- Tnh cc tm cho tng cm trong phn hoch hin hnh. 3- Gn mi i tng cho cm tm gn nht 4- Nu cm khng c s thay i th dng li, ngc li quay li bc 2

    Hnh 1.5.3_1.Gii thut ton k-mean: vi n = 10, k = 2

  • 21

    Thut ton k-medold: 1- Chn k i tng ngu nhin lm tm ca nhm. 2- Gn tng i tng cn li vo cm c tm gn nht. 3- Chn ngu nhin 1 i tng khng l tm, thay mt trong cc tm l n; nu n lm thay i cc i tng trong cm. 4- Nu gn tm mi th quay li bc 2, ngc li th dng.

    Hnh 1.5.3_2.Gii thut ton k-medold: vi n = 10, k = 2 1.6. Phng php khai ph lut kt hp 1.6.1. Gii thiu lut kt hp Bi ton pht hin lut kt hp (association rule mining): vi mt tp hp cc giao dch cho trc, cn tm cc lut d on kh nng xut hin trong mt giao dch cu cc mc (items) ny da trn vic xut hin ca cc mc khc.

  • 22

    Cc v d ca lut kt hp: {Diaper} {Beer} {Milk, Bread} {Eggs, Coke} {Beer, Bread} {Milk}

    Cc nh ngha c bn: - Tp mc (itemset): l mt tp hp gm mt hoc nhiu mc. Tp mc mc k (k-itemset) c k mc. V d: 3-itemset l {Milk, Bread, Diaper}. - Lut kt hp k hiu X -> Y, trong X, Y l cc tp mc. - Tng s h tr (support count)- k hiu : l s ln xut hin ca mt tp mc. V d: ({Milk, Bread, Diaper}) = 2. - h tr (support)- k hiu s: l t l cc giao dch cha c X v Y i vi tt c cc giao dch. V d: s({Milk, Diaper, Beer}) = 2/5. - tin cy (confidence) k hiu c: l t l cc giao dch cha c X v Y i vi cc giao dch cha X. V d: c({Milk, Diaper, Beer}) = 2/3. - Tp mc thng xuyn (frequent/large itemset): l tp mc m h tr ln hn hoc bng mt gi tr ngng minsup. 1.6.2. Pht hin lut kt hp Vi mt tp cc giao dch T, mc ch ca bi ton pht hin lut kt hp l tm ra tt c cc lut c: - h tr s gi tr ngng minsup, v - tin cy gi tr ngng minconf. Cch tip cn vt cn (Brute-force): - Lit k tt c cc lut kt hp c th. - Tnh ton h tr v tin cy cho mi lut. - Loi b i cc lut c h tr nh hn minsup hoc c tin cy nh hn minconf.

  • 23

    => Phng php vt cn ny c chi ph tnh ton qu ln, khng p dng c trong thc t. Xt tp mc: {Milk, Diaper, Beer}

    Cc lut kt hp: {Milk, Diaper} {Beer} (s=0.4,c=0.67) {Milk, Beer} {Diaper} (s=0.4, c=1.0) {Diaper, Beer} {Milk} (s=0.4,c=0.67) {Beer} {Milk, Diaper} (s=0.4, c=0.67) {Diaper} {Milk Beer} (s =04, c=0.5) {Milk} {Diaper, Beer} (s=0.4, c=0.5)

    Ta thy tt c cc lut trn u l s phn tch (thnh 2 tp con) ca cng tp mc : {Milk, Diaper, Beer}. Cc lut sinh ra t cng mt tp mc s c cng h tr, nhng c th khc v tin cy. Do , trong qu trnh pht hin lut kt hp, chng ta c th tch ring 2 yu cu v h tr v tin cy. Vy nn qu trnh pht hin lut kt hp s phn gm 2 bc (2 giai on) quan trng: - Sinh ra cc tp mc thng xuyn (frequent/large itemsets): Sinh ra tt c cc tp mc c h tr minsup. - Sinh ra cc lut kt hp: T mi tp mc thng xuyn (thu c bc trn), sinh ra tt c cc lut c tin cy cao( minconf). Tuy vy, bc sinh ra cc tp mc thng xuyn (bc 1) vn c chi ph tnh ton qu cao.

  • 24

    Vi d mc, th phi xt n 2d cc tp mc c th.

    Lc biu din cc tp mc cn xt, vi d = 5 Vi phng php vt cn(Brute-force) sinh ra cc tp mc thng xuyn (bc 1):

    Hnh 1.6.2_Sinh tp mc thng xuyn bng phng php vt cn - Mi tp mc trong lc u c xt. - Tnh h tr ca mi tp mc, bng cch duyt qua tt c cc giao dch. - Vi mi giao dch, so snh n vi mi tp mc c xt. - phc tp ~ O(N.M.w). Nu M = 2d th phc tp ny l qu ln.

  • 25

    1.6.3. Cc chin lc sinh tp thng xuyn Da vo cc phn tch mc 1.6.2, ta c cc chin lc: - Gim bt s lng cc tp mc cn xt (M): Tm kim (xt) y M = 2d . Sau , s dng cc k thut ct ta gim gi tr M. - Gim bt s lng cc giao dch cn xt (N): Gim gi tr N, khi kch thc (s lng cc mc) ca tp mc tng ln. - Gim bt s lng cc so snh (matchings/comparisons) gia cc tp mc v cc giao dch (N.M ): S dng cc cu trc d liu ph hp (hiu qu) lu cc tp mc cn xt hoc cc giao dch. Khng cn phi so snh mi tp mc vi mi giao dch. T cc chin lc ta xt 2 gii thut c bn: - Gii thut Apriori (c trnh by mc 3.1). - Gii thut FP-Growth. 1.6.4. Gii thut FP-Growth FP-Growth biu din d liu ca cc giao dch bng mt cu trc d liu gi l FP tree. FP-Growth s dng cu trc FP-tree xc nh trc tip cc tp mc thng xuyn. Biu din bng FP-tree: - Vi mi giao dch, FP-tree xy dng mt ng i (path) trong cy. - Hai giao dch c cha cng mt s mc, th ng i ca chng s c phn (on) chung. Cng nhiu cc ng i c cc phn chung, th vic biu din bng FP-tree s cng gn. - Nu kch thc ca FP-tree nh c th lu tr trong b nh lm vic, th gii thut FP-Growth c th xc nh cc tp thng xuyn trc tip t FP-tree lu trong b nh.

  • 26

    Xy dng FP-tree: - Ban u, FP-tree ch cha duy nht nt gc (c biu din bi k hiu null). - C s d liu cc giao dch c duyt ln th 1, xc nh (tnh) h tr ca mi mc. - Cc mc khng thng xuyn b loi b. - Cc mc thng xuyn c sp xp theo th t gim dn v h tr. - C s d liu cc giao dch c duyt ln th 2, xy dng FP-tree. V d: Xy dng FP-tree

    Sinh cc tp mc thng xuyn: - FP-Growth sinh cc tp mc thng xuyn trc tip t FP-tree t mc l n mc gc (bottom-up). - V mi giao dch c biu din bng mt ng i trong FP -tree, chng ta c th xc nh cc tp mc trong FPtree, chng ta c th xc nh cc tp mc thng xuyn kt thc bi mt mc (vd: E), bng cch duyt cc ng i cha mc (E).

  • 27

    Chng 2. ng dng ca khai ph d liu 2.1. H tr ra quyt nh nhp kho trong siu th 2.1.1. Gii thiu v bi ton Bi ton ny tin hnh s dng d liu v mt hng qut my ca mt siu th sau mt nm kinh doanh. Sau ta tin hnh th ha lng d liu ny vi trc honh l thng, trc tung l sn lng tng ng ca thng. T th kinh doanh ny ta s xc nh mt dng th thch hp m biu din c (gn ging) s bin thin ca th . Khi ng vi nm sau th ta s dng li dng th ny d on sc mua mt hng qut ca khch hng trong nm sau v chng cng s din ra theo ng nh dng th ny. Do s gip ngi qun l ch ng hn trong vic nhp v s lng qut thch hp lun p ng tt nht nhu cu ca khch hng.

  • 28

    2.1.2. nh gi ca thy sau khi gii thiu v bi ton Sau khi tin hnh trnh by bng slide v thuyt trnh v ni dung ca bi ton ng dng khai ph d liu h tr ch ng trong vic ra quyt nh ca qun kho p ng nhu cu khch hng th thy c cho nhn xt. i vi vic s dng hi quy d liu s dng tnh cht d bo ch ng trong vic nhp mt hng qut my th khng thch hp. Trong trng hp ny th phn m nhm gii thiu chnh l cch thc ca phng php ly pattern t mt m hnh khi c s xut hin pattern mt ln na th ta c th d on cch thc m pattern ny din ra (thng l nh cc ln trc y). Bn cnh khng nn thu gim cc field lin quan n thi tit ngy hm tng tnh chnh xc cho gii thut. 2.2.Tip th cho 2.2.1.Gii thiu v bi ton Trong mt ca hng bn l th ngi qun l c rt nhiu cch thc trong vic sp xp th t v tr ca cc sn phm m h ang kinh doanh. Trong thc t th khch hng khi mua mt sn phm A th thng hay c xu hng tip tc mua tip cc sn phm B, C, D.. c lin quan n sn phm A. Do ngi qun l phi tm hiu v gi hng m khch hng thng hay thc hin trong cc giao dch rt ra quy lut t tin hnh sp xp li cc mt hng thng c mua cng nhau t chng cnh nhau nhm gip gim cng sc i li cho khch hng, gi mua hng tng doanh s bn hng. c th thc hin c mc tiu nh trn th ta c th p dng khai ph lut kt hp rt ra cc quy lut .

  • 29

    V d: Ca hng bn l trch ra mt s giao dch m khch hng thc hin trong lch s giao dch trc y ca ca hng.

    Sau khi tin hnh p dng mt gii thut khai ph lut kt hp (vd:Apriori) th ta c bng Frequent Itemset nh sau:

    Itemset Support count {I1, I2, I3} {I1, I2, I5}

    2 2

    T bng ny th ngi qun l ca ca hng bn l s yu cu nhn vin ca mnh tin hnh t cc sn phm c trong Itemset trong cng mt hng gn nhau (cng mt gian hng) gip ngi mua hng v kch thch ngi tiu dng.

    Chng 3. Gii thut Apriori Trong chng ny trnh by gii thut Apriori ca Lut kt hp, a ra cc nh gi v bin php ci tin cho gii thut. 3.1. Gii thut Apriori Qu trnh sinh ra lut kt hp chia lm hai bc. Bc u tin l sinh ra cc tp thng xuyn. Bc th hai sinh ra cc lut kt hp. mc 1.6.3 ta thy c bc th nht ca qu trnh rt phc tp. Gii thut Apriori l mt phng php lm gim phc tp bc ny.

  • 30

    Nguyn tc ca gii thut Apriori Loi b da trn h tr: - Nu mt tp mc l thng xuyn, th tt c cc tp con (subsets) ca n u l cc tp mc thng xuyn. - Nu mt tp mc l khng thng xuyn (not frequent) th tt c cc tp cha (supersets) ca n u l cc tp mc khng thng xuyn. Nguyn tc ca gii thut Apriori da trn c tnh khng n iu (anti-monotone) ca h tr:

    Lc biu din cc tp mc cn xt c loi b bt theo h tr

  • 31

    V d: Loi b da trn h tr minsup = 3

    Gii thut Apriori: 1- Sinh ra tt c cc tp mc thng xuyn mc 1(frequent 1-itemsets) 2- Gn k =1 3- Lp li, cho n khi khng c thm bt k tp mc thng xuyn no mi. 3.1- T cc tp mc thng xuyn mc k, sinh ra cc tp mc mc (k+1) cn xt. 3.2- Loi b cc tp mc mc k+1 cha cc tp con l cc tp mc khng thng xuyn mc k. 3.3- Tnh h tr ca cc tp mc mc k+1, bng cch duyt qua tt c cc giao dch. 3.4- Loi b cc tp mc khng thng xuyn mc k+1. 3.5- Thu c cc tp mc thng xuyn mc k+1.

  • 32

    V d: Vi minsup = 2 [1].

    4- Vi mi tp mc thng (I) thu c, sinh ra tt c cc tp con (B) khng rng 5- Vi mi tp B, sinh ra cc lut kt hp: B (I-B) 6- Vi mi lut kt hp, duyt qua tt c cc giao dch. Chn cc lut c tin cy(c) minconf

    V d: vi I= {A1,A2,A5} Cc tp con ca I: {A1}, {A2}, {A5}, {A1,A2},{A1,A5},{A2,A5} C cc lut kt hp sau: {A1} => {A2,A5}; {A2} =>{A1,A5}; {A5} =>{A1,A2}; {A1,A2} =>{A5}; {A1,A5} =>{A2}; {A2,A5} => {A1} Vi frequent itemsets I ={B,C,E}, min_conf =80%. Ta c 2 lut kt hp l:{B,C} =>{E}; {C,E} =>{B}.

  • 33

    3.2. nh gi gii thut Apriori Cc yu t nh hng: - La chn gi tr ngng minsup: Gi tr minsup qu thp s sinh ra nhiu tp mc thng xuyn. iu ny s lm tng s lng tp mc phi xt. - S lng cc mc trong c s d liu (cc giao dch): Cn thm b nh lu gi tr h tr vi mi mc. Nu s lng cc mc(tp mc mc 1) thng xuyn tng ln th chi ph v chi ph I/O (duyt cc giao dch) cng tng. - Kch thc ca c s d liu (cc giao dch): Gii thut phi duyt c s d liu nhiu ln, do chi ph tnh ton ca Apriori tng ln khi s lng cc giao dch tng ln. - Kch thc trung bnh ca cc giao dch: Khi kch thc (s lng cc mc) trung bnh ca cc giao dch tng ln, th di ti a ca cc tp mc thng xuyn cng tng. So snh gia gii thut Apriori v gii thut FP-Growth

    Biu : h tr - Thi gian chy

  • 34

    Biu : S lng giao dch - Thi gian chy 3.3. Cc ci tin ca gii thut Apriori K thut da trn bng bm (hash-based technique): Mt k-itemset ng vi hashing bucket count nh hn minimum support threshold khng l mt frequent itemset.

    Gim giao dch (transaction reduction): Mt giao dch khng cha frequent k-itemset no th khng cn kim tra cc ln sau (cho k+1-itemset). Phn hoch (partitioning): Mt itemset phi frequent trong t nht mt phn hoch th mi c th frequent trong ton b tp d liu. Ly mu (sampling): Khai ph ch tp con d liu cho trc vi mt tr support threshold nh hn v cn mt phng php xc nh tnh ton din (completeness). m itemset ng (dynamic itemset couting): Ch thm cc itemset d tuyn khi tt c cc tp con ca chng c d on l frequent. Chng 4. Demo gii thut Apriori 4.1. Hin thc gii thut Apriori Hin thc gii thut Apriori ci tin da trn cu trc d liu bng bm v gim giao dch . Tt c d liu c truy xut trc tip t database.

  • 35

    a) Ngn ng v cng c s dng Database s dng Mysql. Ngn ng s dng : Java. Kt ni database: s dng th vin mysql-connector-java-5.1.14-bin.jar. Giao din: s dng th vin Swing. b) Gii Thch Code Chng trnh c vit vi 4 class chnh, sau y l m t cho tng class: 1-Class node: cha cc thng tin key, count, next c dng lu cc thng tin ca cc item hoc 1 tp cc item. 2-Class pushData: y l class c hm: getInput vi tham s u vo l username, password ca database, v file dng import vo database. 3-Class getHash: y l class chnh, cha cc hm con, thc thi gii thut apriori ci tin. - Function getNameData:

    + Input:

    Namedb: tn database cn ly danh sch database. User v password ca database. + Ouput:

    Tr v danh sch cc database hin c trong sever. - Function getInput:

    + Input:

    Data: tn ca database cn ly data. Username v password ca database. + Output:

    Tr v mt mng node cc item v count ca mi item sau khi qut ton b database.

  • 36

    - Function calculateInit:

    + Input:

    C: l mng node cc key v count ca key . Minsub: h tr m ngi dng nhp. + Output:

    Tr v mt danh sch cc node c h tr ln hn minSub. - Function genrateTableHash:

    + Input:

    L: l danh sch cc node m mi node vi key l mt item. Count: th t tp ca ca L k+1 s c sinh ra tip theo t tp Ck . V key l 1 item => count = 2;

    + Output:

    Tr v mt bng bm vi key l mt tp cc thng xuyn ca cc item c ngn cch mi du ,, vd ( 123,987). V node cha key, v count ca key , c khi to l 0. - Function GenrateTableHash_2

    + Input:

    L: l danh sch cc node m mi node vi key gm nhiu item. Count: th t tp ca ca L k+1 s c sinh ra tip theo t tp Ck . V key gm nhiu item => count > 2; + Output:

    Tr v mt bng bm vi key l mt tp cc thng xuyn ca cc item c ngn cch mi du ,, vd ( 123,987). V node cha key, v count ca key , c khi to l 0. - Function CompareData:

    + Input:

    hasTable: bng bm vi key l mt thng xuyn, v value l node. Count: th t ca tp L.

  • 37

    + Output:

    Tr v mt danh sch node t bng bm, sau khi qut qua database. - Function GenrateListPartitions:

    + Input:

    L: l danh sch cc node, vi mi node l mt tp thng xuyn. + Output:

    Tr v mt danh sch ca mt danh sch, cha cc lut c sinh ra, nhng cha so snh vi h tr. - Function calculateRule:

    + Input:

    L: l danh sch ca mt danh sch, cha cc lut cha c so snh vi h tr. MinCon: h tr ti thiu ca cc lut. + Output:

    Tr v mt danh sch ca mt sanh sch, vi mt danh sch l mt lut tha mn h tr ti thiu. 4-Class AppAprioriImprove: y l class cha giao din ha, thao tc vi ngi dng, gm cc s kin: - jButton4ActionPerformed: + Input:

    username v password ca database ca ngi dng. + Output:

    Nu kt ni thnh cng, hin ra danh sch cc database c trong server combobox ca name Data. Kt ni li, hin thng bo, v yu cu ngi dng nhp li. - jButton2ActionPerformed: + Input:

  • 38

    Username v password ca database. getFile: tn file cn import vo database. NameData: tn database cn import vo . + Output:

    Import d liu vo database thnh cng , xut hin thng bo import to database successful.

    - jButton5ActionPerformed : tng t nh s kin jButton4ActionPerformed. - jButton6ActionPerformed: + Input:

    Username v password ca ngi dng. nameData: cn khai ph. Min Support v Min Confidence do ngi dng nhp vo. + Output:

    Xut ra cc tp thng xuyn tha MinSup, v cc lut tha MinCon.

    4.2. Hng dn s dng demo 4.2.1.Ci t mi trng Ci t b jdk ca java, c th download t : http://www.oracle.com/technetwork/java/javase/downloads/index.html Ci t MySQL, y gii thiu dung b Wamp Server, c th download t : http://www.wampserver.com/en/

    S dng th vin kt ni database MySQL: mysql-connector-java-5.1.14-bin.jar. S dng th vin t hp ra mt tp t mt danh sch : combinatoricslib-2.0.jar

  • 39

    4.2.2. ng dng a) Import Data t File d liu. Khi chy file AppAprioriImprove.jar, hin th giao din:

    Hnh 4.2.2_1

    Chn Tab import Data , nhp user name v password ca Database. Nu User name hoc password sai, th s bo li connect to database, v bt ngi dng nhp li.

    Hnh 4.2.2_2

  • 40

    Nhp ng Username v Password, s hin ra danh sch cc database.

    Hnh 4.2.2_3

    Chn File, v sau nhn Import, s import file vo database, vi bng Transaction, C hin File Log.

    Hnh 4.2.2_4

  • 41

    Export log ra file E:\logPushToData.txt. y l ng dn c nh, nu my khng c a E:\, th s b li, xin vui long restart li App.

    b) Chy Gii Thut Apriori ci tin, vi Database va import. Chn Tab Apriori Improve. Nhp User name v password tng t nh import data.

    Chn Data chy Gii thut, lu Data phi cha bng Transaction.

    Hnh 4.2.2_5

    Nhp Min Support kiu l s nguyn v ln hn 0, v Min Confidence l kiu s thc vi 0 < MinCon < 1. Nu nhp sai, chng trnh s bo li, v bt bn phi nhp li.

  • 42

    Hnh 4.2.2_6

    Hnh 4.2.2_7

  • 43

    Sau nhp ng input, chn Start. Chng trnh chy v xut ra kt qu.

    Hnh 4.2.2_8

    Nu mun Export kt qu, chn Export Result, chng trnh s mt nh lu file kt qu vo E:\Result_GenRule.txt.

  • 44

    Chng 5. nh gi tng kt 5.1. u im Hiu v hin thc thnh cng gii thut Apriori . ng dng hin thc c phn ci tin ca gii thut bng cu trc d liu bng bm v thc hin gim s giao dch, lm gim thi gian chy ca gii thut. ng thi tng tc chy lu ton b d liu vo database. Mi truy xut iu thc hin trc tip t database. ng dng c giao din trc quan, d s dng. 5.2. Nhc im Cha tm c cch kim tra tnh ng n ca cc lut kt hp, l kt qu ca ng dng. Cha hin thc vic loi b cc lut kt hp khi khng c tnh kh dng trong thc t.

    ng dng cha hin thc c thanh trng thi ang x l d liu dng thanh trng thi trc quan cho ngi s dng.

  • 45

    Ti liu tham kho - Tp slide bi ging mn Data Mining , khoa Khoa hc & My tnh, trng h Bch Khoa TP H Ch Minh. - Tp slide bi ging mn Khai ph D liu, Nguyn Nht Quang, trng h Bch Khoa H Ni. - Ebook: Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, Third Edition, Morgan Kaufmann Publishers. - [1]: http://bis.net.vn/forums/t/389.aspx - http://en.wikipedia.org/wiki/Apriori_algorithm