Upload
ngaovl
View
218
Download
0
Embed Size (px)
Citation preview
8/8/2019 Bao Cao Tot Nghiep KPDL
1/67
NHN XT CA GING VIN HNG DN................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
................................................................................................................................
.............................................................................................................................................
.............................................................................................................................................
.............................................................................................................................................
.............................................................................................................................................
.............................................................................................................................................
1
8/8/2019 Bao Cao Tot Nghiep KPDL
2/67
LI NI UNgy nay cc lnh vc khoa hc k thut ang ngy mt pht trin mnh m.
c bit l nghnh khoa hc my tnh rt pht trin, n c ng dng rt nhiu
trong cc lnh vc khc nhau ca cuc sng nh: Gio dc, Y t, Kinh t, Khoa hc,Xy dng, N tr thnh mt phn khng th thiu c trong cuc sng hngngy ca con ngi.Vic dng cc phng tin tin hc t chc v khai thc ccc s d liu c pht trin t nhng nm 60. c bit trong nhng nm gny vai tr ca my tnh trong vic lu tr v x l thng tin ngy cng tr ln quantrng. Bn cnh cc thit b thu thp d liu t ng tng i pht trin tora nhng kho d liu khng l. Vi s pht trin mnh m ca cng ngh in tto ra cc b nh c dung lng ln, b x l tc cao cng vi cc h thngmng vin thng, ngi ta xy dng cc h thng thng tin nhm t ng ho
mi hot ng kinh doanh ca mnh. iu ny to ra mt dng d liu tng lnkhng ngng v ngay t cc cc giao dch n gian nht nh mt cuc in thoi,kim tra sc khe, s dng th tn dng, v.v.u c ghi vo trong my tnh. Choti nay con s ny tr ln khng l, bao gm cc c s d liu, thng tin khchhng, d liu lch s cc giao dch, d liu bn hng, d liu cc ti khon vay, sdng vn,..Vn t ra l lm th no s l khi lng thng tin cc ln nhvy pht hin ra cc tri thc tim n trong n.
lm c iu ngi ta s dng qu trnh Pht hin tri thc trong
c s d liu( Knowledge Discovery in Database-KDD). Nhim v ca KDD l td liu sn c phi tm ra nhng thng tin tim n c gi tr m trc cha c
pht hin cng nh tm ra nhng xu hng pht trin v cc xu hng tc ng lnchng .Cc k thut cho php ta ly c cc tri thc t c s d liu sn c c gi l k thut Khai ph d liu( Data Mining).
T nhng l do chng em hiu v tiKhai ph d liu bng lutkt hp. Nhm phn tch cc d liu v s dng cc k thut tm ra nhng muthng tin, hot ng c tnh chnh quy trong tp d liu m ngi s dng mong
mun, ng thi p dng vo bi ton Qun l bn hng ti siu th.Trong qu trnh lm n hon thnh ti ny chng nhn c s
gip ch bo tn tnh ca cc thy c gio trong khoa cng ngh thng tin v ccbn trong lp, c bit l thy gio Trn Hng Cng. Nhng do thi gian c giihn v nng lc cn hn ch nn khng trnh khi nhng sai st, chng em mongnhn c s gp hn na ca thy c v cc bn.
Chng em cng xin chn thnh cm n cc thy gio, c gio trong khoaCng Ngh Thng Tin to iu kin gip chng em trong xut thi gian lm
n v hc tp ti trng.
2
8/8/2019 Bao Cao Tot Nghiep KPDL
3/67
Chng em xin chn thnh cm n cc bn cng lp to iu kin chochng em hon thnh tt lun vn ny.
Chng em xin chn thnh cm n!
Nhm sinh vin thc hin:
Phm Th Hon
Trn Vit Phng ng
Lp C-H-KHMT3-K1
3
8/8/2019 Bao Cao Tot Nghiep KPDL
4/67
TM TT NNi dung ca n l nhng kin thc v khai ph d liu s dng lut kt
hp, cc thut ton kinh in trong qu trnh s dng lut kt hp, cch p dng
thut ton Apriori vo mt phn nh trong bi ton Qun l bn hng ti siu th .Mc ch ca n l:
Phn tch cc d liu v s dng cc k thut tm ra nhng mu thngtin, hot ng c tnh chnh quy trong tp d liu m ngi s dng mong mun.
a ra cc thut ton c bn nh Apriori, thut ton tm lut kt hp khngpht sinh ng vin da vo cu trc cy FP- Tree, v.v.trong vic s dng lut kthp phn tch mt c s d liu no .
Phn tch c s d liu v ci t thut ton Apriori p dng mt phn
nh vo bi ton Qun l bn hng ti siu th . n bao gm c 3 chng, vi cc ni dung nh sau:
Chng I: Tng quan v khai ph d liu. Nidung trong chng ny sc trnh by bao gm:Khai ph d liu v pht hin tri thc, qu trnh pht hintri thc t c s d liu, khai ph d liu c li ch g? Cc k thut khai ph dliu, nhim v chnh ca khai ph d liu, cc phng php khai ph d liu, ngdng ca khai ph d liu v mt s thch thc t ra cho vic khai ph d liu.
Chng II: Tp ph bin v lut kt hp: Ni dung uc trnh by bao
gm: Mt s khi nim, tnh cht c bn ca tp ph bin v lut kt hp, tm tpph bin, mt s thut ton c bn v lut kt hp, mt s v d minh ha cc thutton.
Chng III: Cch ci t v th nghim thut ton tm tp ph bin vlut kt hp: Phn tch mt c s d liu, trnh by v cch ci t chng trnhkhai thc lut kt hp trong vic qun l bn hng ti siu th. Da vo kt qu nym ngi qun l bn hng ti th siu nm bt c nhng nhm mt hng no clin quan ti nhau, phc v cho mc ch qun l v la chn cc mt hng kinh
doanh.
4
8/8/2019 Bao Cao Tot Nghiep KPDL
5/67
SUMMARY OF THE PROJECTThis projects content is the knowledge of data mining which uses
association rules, the classical algorithms in the proccess of using association rules,
how to apply Apriori Algorithms to a small part on Sales Management Problem insupermarket.
The purposes of this project are:
Analysing data and using technique to find out sample informations,actions which have regular nature in data files that users want.
Bringing out the classical algorithms such as Apriori, the algorithms offinding association rules without arising subsets (candidates) which base on FP-Tree Structure...etc in using association rules to analyse any database.
Analysing database and installing Apriori Algorithms to apply partly toSales Management Task in supermarket.
The project has 3 chapters, with main content as follows:
Chapter I: Overview of data mining. The contents of this chapter whichwill be presented consist of: Data Mining and Knowledge Discovery in database,the advantages of data mining? Techniques of data mining, main task of datamining, methods of data mining, application of data mining and some challengeswhich are set up for data mining.
Chapter II: Frequent- Itemset and Association Rules. This chapterscontent includes in: some concepts, basic property of Frequent- Itemset andAssociation Rules, searching for Frequent- Itemset, some basic algorithms ofAssociation Rules, some examples which illustrates algorithms.
Chapter III: How to install and test The Algorithms of finding FrequentItemset and Association Rules. They are: Analysing one database, presenting theway to install program Exploiting Frequent Itemset in Sales Management insupermarket. Sales Manager bases on this result to know gather of related product
to statisfy the purpose of management and choice products to do bussiness.
5
8/8/2019 Bao Cao Tot Nghiep KPDL
6/67
MC LCNHN XT CA GING VIN HNG DN ................................................................ 1
LI NI U ....................................................................................................................... 2
TM TT N ................................................................................................................. 4SUMMARY OF THE PROJECT ......................................................................................... 5
DANH SCH BNG BIU ................................................................................................. 9
DANH SCH CC T VIT TT ....................................................................................10
..............................................................................................................................................11
M U .............................................................................................................................. 12
Chng I: TNG QUAN V KHAI PHI D LIU ......................................................13
1.1. t vn ...........................................................................................................................13
1.2. Khai ph d liu v pht hin tri thc ..............................................................................14
1.3. Qu trnh pht hin tri thc t c s d liu .............................................. .....................141.3.1. Xc nh bi ton ..................................................................................................................... ......151.3.2. Thu thp v tin x l ................................................................................................................ ....15
1.3.2.1. Gom d liu ............................................................................................................................161.3.2.2. Chn lc d liu .....................................................................................................................161.3.2.3. Lm sch ......................................................................................................................... .......16
1.3.2.4. Lm giu d liu .......................................................................................................... ........ ..171.3.2.5. M ho d liu ........................................................................................................................171.3.2.6. nh gi v trnh din ............................................................................................................17
1.3.3. Khai ph d liu .......................................................................................................................... ...181.3.4. Pht biu v nh gi kt qu .........................................................................................................181.3.5. S dng tri thc pht hin ....................................................................................................... ..18
1.4. Khai ph d liu c nhng li ch g .................................................................................18
1.5. Cc k thut khai ph d liu ...........................................................................................191.5.1. K thut khai ph d liu m t .....................................................................................................191.5.2. K thut khai ph d liu d on .................................................................................................19
1.6. Nhim v chnh ca khai ph d liu ...............................................................................191.6.1. Phn lp (Classification) ................................................................................................................201.6.2. Hi quy (Regression) .................................................................................................................. ...201.6.3. Gom nhm (Clustering) .............................................................................................................. ...201.6.4. Tng hp (Summarization) ...........................................................................................................201.6.5. M hnh rng buc (Dependency modeling) ........................................................................ .........201.6.6. D tm bin i v lch (Change and Deviation Dectection) .......................................... .........21
1.7. Cc phng php khai ph d liu ...................................................................................211.7.1. Cc thnh phn ca gii thut khai ph d liu .................................................................. ........ ...211.7.2. Mt s phng php khai thc d liu ph bin ............................................................................22
1.7.2.1. Phng php quy np (Induction)..........................................................................................221.7.2.2. Cy quyt nh v lut ............................................................................................................221.7.2.3. Pht hin cc lut kt hp .......................................................................................................22
1.7.2.4. Mng Neuron .............................................................................................................. ........ ...231.7.2.5. Gii thut di truyn .................................................................................................................24
6
8/8/2019 Bao Cao Tot Nghiep KPDL
7/67
1.8. ng dng ca khai ph d liu .................................................................................... .....24
1.9. Mt s thch thc t ra cho vic khai ph d liu .........................................................25
Chng II: TP PH BIN V LUT KT HP ..........................................................27
2.1. M u ................................................................................................................................27
2.2. Cc khi nim c bn .........................................................................................................272.2.1. nh ngha 2. 2.1: Ng cnh khai ph d liu ......................................................................... ......272.2.2. nh ngha 2. 2. 2: Cc kt ni Galois ...........................................................................................272.2.3. nh ngha 2.2.3: h tr (Support) ...........................................................................................272.2.4. nh ngha 2 2.4: tin cy ( Confidence) ...................................................................................28
2.2.4.1. Tnh cht 2. 2.4.1: H tr ca tp con....................................................................................282.2.4.2. Tnh cht 2.2.4.2 .....................................................................................................................282.2.4.3. Tnh cht 2.2.4.3 .....................................................................................................................282.2.4.4. Tnh cht 2. 2.4.4 ....................................................................................................................28
2.2.5. nh ngha 2.2.5: Tp mt hng ph bin ......................................................................................292.2.6. nh ngha 2.2.6: Lut kt hp ............................................................................................... .......29
2.2.6.1. Tnh cht 2.2.6.1: Lut kt hp khng c hp thnh............................................................. 29
2.2.6.2. Tnh cht 2.2.6.2: Lut kt hp khng c tnh tch................................................................292.2.6.3. Tnh cht 2.2.6.3: Lut kt hp khng c tnh bc cu.......................................................... 302.2.6.4. Tnh cht 2.2.6.4 ....................................................................................................................30
2.3. Tm tp ph bin ................................................................................................................302.3.1. Mt s khi nim ............................................................................................................................302.3.2. Thut ton Apriori ..........................................................................................................................31
2.3.2.1. M t thut ton .....................................................................................................................312.3.2.2. V d minh ho cho thut ton Apriori ........................................................................ ........ ..332.3.2.3. Procedure-Code.....................................................................................................................342.3.2.4. To tp ng vin (k+1)- hng mc.........................................................................................35
2.4. Tm lut kt hp .................................................................................................................362.4.1. Pht biu bi ton khai ph lut kt hp .................................................................................... ....36
2.4.2. Pht trin gii php hiu qu trong khai thc lut kt hp .............................................................382.5. Quy trnh khai thc lut kt hp .......................................................................................40
2.6. Mt s thut ton khc ......................................................................................................402.6.1. Thut ton khai ph song song cho lut kt hp m ......................................................................402.6.2. Thut ton FP-Growth .......................................................................................................... .........42
2.6.2.1 Bn cht...................................................................................................................................422.6.2.2. Qui trnh................................................................................................................................. 422.6.2.3. Thut ton FP_Growth ...........................................................................................................51
Chng III: CI T V TH NGHIM THUT TON TM TP PH BIN VLUT KT HP ................................................................................................................. 52
3.1. Pht biu bi ton............................................................................................................... 523.2. La chn thut ton ci t phn mm....................................................................... 52
3.3. Yu cu khi ci t thut ton........................................................................................... 52
3.4. C s d liu....................................................................................................................... 533.4.1. Giao din chnh ca c s d liu..................................................................................................533.4.2. Bng danh mc cc Nh cung cp hng ha..................................................................................543.4.3. Bng danh mc cc Hng Ho.......................................................................................................553.4.4. Bng danh mc cc Khch Hng...................................................................................................563.4.5. Bng danh mc cc Ho n.........................................................................................................573.4.6. Bng danh mc chi tit Ho n...................................................................................................583.4.7. Ghi XML........................................................................................................................................59
3.5. Giao din chnh chng trnh............................................................................................ 59
7
8/8/2019 Bao Cao Tot Nghiep KPDL
8/67
3.6. Kt ni d liu.................................................................................................................... 60
3.7. Thm d liu Xml ..............................................................................................................60
3.8. Kt qu phn tch ...............................................................................................................61
3.9. Kt qu lc MinSup = 10 ...................................................................................................61
3.10. Kt qu lc MinCon = 40% .............................................................................................62
KT LUN CHUNG .......................................................................................................... 63
HNG PHT TRIN TI ........................................................................................64
TI LIU THAM KHO ................................................................................................... 65
BNG I CHIU THUT NG VIT - ANH ............................................................. 66
DANH SCH HNH VHnh 1.1. Qu trnh pht hin tri thc t c s d liu ....................................................14
Hnh 1.2. Qu trnh pht hin tri thc ..............................................................................15
Hnh 1.3: M hnh li ch ca khai ph d liu ................................................................19
Hnh 1.4.Th hin s khai ph d liu bng mng Neunon....................................... 24
Hnh 2.5. Minh ha lut kt hp khng c tnh tch ........................................................30
Hnh 3.1. Giao din chnh ca c s d liu ..................................................................... 53
Hnh 3.2. Danh mc nh cung cp ....................................................................................54
Hnh 3.3. Danh mc hng ha ...........................................................................................55
Hinh 3.4.Danh mc khch hng ........................................................................................ 56
Hnh 3.5. Danh mc ha n ............................................................................................. 57
Hnh 3.6. Danh mc chi tit ha n .................................................................................58Hnh 3.7. Ghi XML ............................................................................................................. 59
8
8/8/2019 Bao Cao Tot Nghiep KPDL
9/67
Hnh 3.8. Giao din chnh ca chng trnh ....................................................................59
Hnh 3.9. Kt ni d liu .................................................................................................... 60
Hnh 3.10. Thm d liu XML .........................................................................................60
Hnh 3.11. Kt qu phn tch ............................................................................................. 61
Hnh 3.12. Kt qu lc ph bin ti thiu .....................................................................61
Hnh 3.13. Kt qu lc tin cy ....................................................................................... 62
DANH SCH BNG BIUBng 2.1. CSDL s dng minh ho thut ton Apriori .................................................... 33
Bng 2. 2. Kt qu thc hin thut ton Aprori cho CSDL D .......................................... 34
Bng 2. 3. V d v mt CSDL giao dch D .....................................................................37
Bng 2.4. Tp mc thng xuyn Minsup = 50% .............................................................37
Bng 2.5. Lut kt hp sinh t tp mc ph bin ABE .................................................... 38
Bng 2.6. Cy FP ................................................................................................................ 43
Bng 2.7. Cy FP ................................................................................................................ 43
Bng 2.8. Cy FP ................................................................................................................ 44
Bng 2.9. Cy FP ................................................................................................................ 45
Bng 2.10. Cy FP .............................................................................................................. 46
Bng 2.11. Cy FP .............................................................................................................. 48
Bng 2.12. Cy FP .............................................................................................................. 48Bng 2.13. Cy FP .............................................................................................................. 49
9
8/8/2019 Bao Cao Tot Nghiep KPDL
10/67
Bng 2.14.C s d liu ......................................................................................................50
DANH SCH CC T VIT TT
T vit tt Din gii
KDD Pht hin tri thc trong c s d liu
DL D liu
CSDL C s d liu
KPDL Khai ph d liuNCKPDL Ng cnh khai ph d liu
LKH Lut kt hp
10
8/8/2019 Bao Cao Tot Nghiep KPDL
11/67
11
8/8/2019 Bao Cao Tot Nghiep KPDL
12/67
M US pht trin ca cng ngh thng tin v vic ng dng cng ngh thng tin
trong nhiu lnh vc ca i sng, kinh t x hi trong nhiu nm qua cng ng
ngha vi lng d liu c cc c quan thu thp v lu tr ngy mt nhiu ln.H lu tr cc d liu ny v cho rng trong n n cha nhng gi tr nht nh no. Tuy nhin, theo thng k th ch c mt lng nh ca nhng d liu ny(khong t 5% n 10%) l lun c phn tch, s cn li h khng bit s philm g hoc c th lm g vi chng nhng h vn tip tc thu thp rt tn km vi ngh lo s rng s c ci g quan trng b b qua sau ny c lc cn n n.Cc phng php qun tr v khai thc c s d liu truyn thng khng p ngc k vng ny, nn ra i K thut pht hin tri thc v khai ph d liu(KDD - Knowledge Discovery and Data Mining).
K thut pht hin tri thc v khai ph d liu v ang c nghin cu,ng dng trong nhiu lnh vc khc nhau cc nc trn th gii, ti Vit Nam kthut ny tng i cn mi m tuy nhin cng ang c nghin cu v dn avo ng dng.
Hin nay c rt nhiu phng php kinh doanh cng nh c rt nhiuphn mm qun l vic kinh doanh . V d nh phn mm qun l bn hngti th siu bng Fox, C#, VB,...Tuy nhin ti ny chng em khng xy dng mt
phn mm qun l bn hng ti th siu hon chnh m ch tm hiu v ci t mtkha cnh nh trong bi ton Qun l bn hng ti siu th . l phn tch d liu
bng lut kt hp trong qu trnh tm hiu cc mt hng c lin quan ti nhau nhth no? Gip cho nh qun l tm hiu, phn tch la chn cc mt hng kinhdoanh tt hn.
Trong phm vi ca ti nghin cu ny, chng em xin c trnh by:
Nhng kin thc v khai ph d liu s dng lut kt hp. y l dng lutkt hp tng i n gin nhng tnh hiu qu cao, gip tm ra c nhng lutqu him.
a ra cc nh ngha, tnh cht v mt s thut ton c bn thng c pdng trong qu trnh tm lut kt hp ca mt c s d liu.
Phn tch v ci t thut ton Apriori p dng vo mt phn nh trong biton Qun l bn hng ti siu th .
12
8/8/2019 Bao Cao Tot Nghiep KPDL
13/67
Chng I: TNG QUAN V KHAI PHI D LIU1.1. t vn
Trong k nguyn Internet, Intranets, Warehouses, m ra nhiu c hi cho
nhng nh doanh nghip trong vic thu thp v x l thng tin. Hn na, cc cngngh lu tr v phc hi d liu pht trin mt cch nhanh chng v th c s dliu cc c quan, doanh nghip, n v ngy cng nhiu thng tin tim n phong
ph v a dng.
C s d liu trong cc doanh nghip th d liu giao dch ng mt vai trrt quan trng cho vic hoch nh k hoch kinh doanh trn thng trng vonhng nm tip theo. Hin ti, vic s dng cc d liu ny tuy t c mt skt qu nht nh song vn cn mt s vn tn ng nh:
- Da hon ton vo d liu, khng s dng tri thc c sn v lnh vc, ktqu phn tch kh c th lm r c.
- Phi c s hng dn ca ngi dng xc nh phn tch d liu nh thno v u.
Trong iu kin v yu cu ca x hi, i hi phi c nhng phng phpnhanh, ph hp, t ng, chnh xc v c hiu qu ly c thng tin c gi tr.Cc tri thc chit xut c t c s d liu trn s l mt ngun ti liu h tr cholnh o trong vic ln k hoch hot ng hoc trong vic ra quyt nh sn xut
kinh doanh. V vy, tnh ng dng ca khai ph d liu bng lut kt hp t c sd liu giao dch l mt vn ang c quan tm c bit trong x hi hin nay.
Mc ch ca vic nghin cu l xy dng mt gii php hiu qu tnh ngdng lut kt hp trong vic ra quyt nh ca c quan doanh nghip da trn c sd liu giao dch.
S pht trin nhanh chng cc ng dng cng ngh thng tin v Internet vonhiu lnh vc i sng x hi, qun l kinh t, khoa hc k thut,... to ra nhiuc s d liu khng l v d nh c s d liu bn hng ca mt siu th cha hng
nghn giao tc bn hng; hay c s d liu ca mt h thng thng tin v khchhng trong mt ngn hng,... khai ph hiu qu ngun thng tin t cc c s dliu ln h tr tin trnh ra quyt nh, bn cnh cc phng php khai thc thngtin truyn thng, cc nh nghin cu pht trin cc phng php, k thut v
phn mm mi h tr tin trnh khai ph, phn tch tng hp thng tin.
C rt nhiu k thut khai ph d liu khc nhau tun theo cc bc qutrnh pht hin tri thc, gii quyt cc nhim v khai ph d liu. Sau ychng em s ln lt trnh by nhng vn nu ra.
13
8/8/2019 Bao Cao Tot Nghiep KPDL
14/67
1.2. Khai ph d liu v pht hin tri thc
Yu t thnh cng trong mi hot ng kinh doanh ngy nay l vic bit sdng thng tin c hiu qu. iu c ngha l t cc d liu c sn phi tm ranhng thng tin tim n m trc cha c pht hin, tm ra nhng xu hng
pht trin v nhng yu t tc ng ln chng. Thc hin cng vic chnh l qutrnh pht hin tri thc trong c s d liu m trong k thut cho php ta lyc cc tri thc chnh ra t k thut khai ph d liu.
Nu quan nim tri thc l mi quan h ca cc mu gia cc phn t d liuth qu trnh pht hin tri thc ch ton b qu trnh trit xut tri thc t c s dliu, trong tri qua nhiu giai on khc nhau nh: Tm hiu v pht hin vn ,thu thp v tin x l d liu, pht hin tri thc, minh ho v nh gi tri thc
pht hin v a kt qu vo thc t.
Khai ph d liu c nhng im khc nhau v mt ng ngha so vi phthin tri thc t c s d liu nhng thc t ta thy khai ph d liu l ch mtgiai on pht hin tri thc trong mt chui cc giai on qu trnh pht hin trithc trong c s d liu. Tuy nhin y l giai on ng vai tr ch cht v lgiai on chnh to nn tnh a ngnh ca pht hin tri thc trong c s d liu.
1.3. Qu trnh pht hin tri thc t c s d liu
Pht hin tri thc t c s d liu l mt qu trnh c s dng nhiu phngphp v cng c tin hc nhng vn l mt qu trnh m trong con ngi lm
trung tm. Do n khng phi l mt h thng phn tch t ng m l mt hthng bao gm nhiu hot ng tng tc thng xuyn gia con ngi v c s dliu, tt nhin l vi s h tr ca cc cng c tin hc.
Xc nh bi ton
Thu thp v tin x l dliu
Khai ph d liu trit xuttri thc
Pht biu kt qu v nhgi trit xut tri thc
S dng tri thc v phthin c trit xut tri thc
Hnh 1.1. Qu trnh pht hin tri thc t c s d liu
Mc d c 5 giai on nh trn( hnh 1.1) xong qu trnh pht hin tri thc t cs d liu l 1 qu trnh tng tc v lp i lp li theo kiu xon chn c, trong
14
8/8/2019 Bao Cao Tot Nghiep KPDL
15/67
ln lp sau hon chnh hn ln lp trc. Ngoi ra giai on sau li da trn kt quthu c ca giai on trc theo kiu thc nc. y l mt qu trnh bin trngmang tnh cht hc ca qu trnh pht hin tr thc v l phng php lun trongvin pht hin tri thc. Cc giai on s c trnh by c th nh sau:
1.3.1. Xc nh bi ton
y l mt qu trnh mang tnh nh hnh vi mc ch xc nh c lnhvc yu cu pht hin tri thc v xy dng bi ton tng kt. Trong thc t cc cs d liu c chuyn mn ho v phn chia theo cc lnh vc khc nhau nh: Sn
phm, kinh doanh, ti chnh, v.v.Vi mi tri thc pht hin c c th c gi trtrong lnh vc ny nhng li khng mang nhiu ngha vi mt lnh vc khc. Vvy vic xc nh lnh vc v nh ngha bi ton gip nh hng cho giai ontip theo thu thp v tin x l d liu.
1.3.2. Thu thp v tin x l
Cc c s d liu thu c thng cha rt nhiu thuc tnh nhng li khngy , khng thun nht, c nhiu li v cc gi tr c bit. V vy giai on thuthp v tin x l d liu tr nn rt quan trng trong qu trnh pht hin tri thc tc s d liu. C th ni giai on ny chim t 70%-80% gi thnh trong ton b
bi ton.
Ngi ta chia giai on v tin x l d liu nh: Gom d liu, chn d liu,lm sch, m ho d liu, lm giu, nh gi v trnh din d liu. Cc cng on
ny c thc hin theo trnh t nht nh c th nh sau:
Knowledge
Pattern
DiscoveryTransforme
CleansedPreprocessed
Preparated
Data
Target
Gom DL
M ho DL
Chn lc DL
Lm giu DL
Lm sch DL
nh gi & trnh din
Internet,..
Hnh 1.2. Qu trnh pht hin tri thc
15
8/8/2019 Bao Cao Tot Nghiep KPDL
16/67
1.3.2.1. Gom d liu
Tp hp d liu l bc u tin trong qu trnh khai ph d liu. y lbc c khai thc trong mt c s d liu, mt kho d liu v thm ch cc dliu t cc ngun ng dng Web.
1.3.2.2. Chn lc d liu
giai on ny d liu c la chn hoc phn chia theo mt s tiu chunno . y l giai on chn lc, trch rt cc d liu cn thit t c s d liu tcnghip vo mt c s d liu ring. Chng ta chn ra nhng d liu cn thit chocc giai on sau. Tuy nhin cng vic thu gom d liu vo mt c s d liuthng rt kho khn v d liu nm ri rc khp ni trong c quan, t chc cngmt loi thng tin, nhng c to lp theo cc dng hnh thc khc nhau. V dni ny dng kiu chui, ni kia li dng kiu s khai bo mt thuc tnh no
ca khch hng. ng thi cht lng d liu ca cc ni cng khng ging nhau.V vy chng ta cn chn lc d liu tht tt chuyn sang giai on tip theo
1.3.2.3. Lm sch
Giai oan th ba ny l giai on hay b sao lng, nhng thc t n l mtbc rt quan trng trong qu trnh khai ph d liu. Mt s li thng mc phitrong khi gom d liu l tnh khng cht ch, logc. V vy, d liu thng chacc gi tr v ngha v khng c kh nng kt ni d liu. Giai on ny s tinhnh x l nhng dng d liu khng cht ch ni trn. Nhng d liu dng ny
c xem nh thng tin d tha, khng c gi tr. Bi vy, y l mt qu trnh rtquan trng v d liu ny nu khng c lm sch - tin x l - chun b trcth s gy nn nhng kt qu sai lch nghim trng.
Giai on ny thc hin mt s chc nng sau:
- iu ho d liu: Cng vic ny nhm gim bt tnh khng nht qun dliu ly t nhiu ngun khc nhau. Phng php thng thng l kh cc trnghp trng lp d liu v thng nht cc k hiu. V d mt khch hng c th cnhiu bn ghi do vic nhp sai tn hoc do qu trnh thay i mt s thng tin c
nhn gy ra v to ra s nhm ln l c nhiu khch hng.
- X l cc gi tr khuyt: Tnh khng y ca d liu c th gy rahin tng d liu cha cc gi tr khuyt. y l hin tng kh ph bin.
Ngi ta s dng nhiu phng php khc nhau x l cc gi tr khuyt nh:B qua cc b c gi tr khuyt, im b sung bng tay, dng mt hng chung
b sung vo gi tr khuyt, dng gi tr trung bnh ca mi bn ghi trn thuctinh khuyt, dng gi tr trung bnh ca mi bn ghi cng lp hoc dng cc gitr m tn sut xut hin ln nht.
16
8/8/2019 Bao Cao Tot Nghiep KPDL
17/67
- X l nhiu v cc ngoi l: Thng thng nhiu d liu c th l nhiungu nhin hoc cc gi tr bt bnh thng. lm sch nhiu, ngi ta c ths dng phng php lm trn nhiu hoc dng cc gii thut pht hin ra ccngoi l x l.
1.3.2.4. Lm giu d liu
Mc ch ca giai on ny l b sung thm nhiu loi thng tin c linquan vo c s d liu gc. lm c iu ny, chng ta phi c cc c s dliu khc bn ngoi c lin quan ti c s d liu gc ban u. Ta tin hnh bsung nhng thng tin cn thit, lm tng kh nng khm ph tri thc.
y l bc mang tnh t duy trong khai ph d liu. giai on nynhiu thut ton khc nhau c s dng trch ra cc mu t d liu. Thutton thng dng l nguyn tc phn loi, nguyn tc kt hp hoc cc m hnh
d liu tun t, v. v.
Qu trnh lm giu bao gm vic tch hp v chuyn i d liu. Cc dliu t nhiu ngun khc nhau c tch hp thnh mt kho thng nht. Cckhun dng khc nhau ca d liu cng c quy i, tnh ton li a vmt kiu thng nht, tin cho qu trnh phn tch.
1.3.2.5. M ho d liu
Tip theo l giai on chuyn i d liu, d liu a ra c th s dng viu khin c bi vic t chc li n. D liu c chuyn i ph hp vimc ch khai thc. Mc ch ca giai on ny l chuyn i kiu d liu v nhngdng thut tin tin hnh cc thut ton khm ph d liu. C nhiu cch m hod liu nh:
- Phn vng: D liu l gi tr chui, nm trong cc tp cc chui c inh.
- Bin i gi tr nm thnh con s nguyn l s nm tri qua so vi nmhin hnh.
- Chia gi tr s theo mt h s tp cc gi tr nm trong vng nh hn.
- Chuyn i Yes-No thnh 0-1.1.3.2.6. nh gi v trnh din
y l giai on cui trong qu trnh khai ph d liu. giai on ny, ccmu d liu c chit xut ra bi phn mm khai ph d liu. Khng phi bt cmu d liu no cng u hu ch, i khi n cn b sai lch. V vy, cn phi utin nhng tiu chun nh gi chit xut ra cc tri thc cn chit xut ra.
Trn y l 6 giai on trong qu trnh khai ph d liu.
17
8/8/2019 Bao Cao Tot Nghiep KPDL
18/67
1.3.3. Khai ph d liu
Giai on khai thc d liu c bt u sau khi d liu c thu thp vtin hnh x l. Trong giai on ny, cng vic ch yu l xc nh c bi tonkhai ph d liu, tin hnh la chn cc phng php khai thc ph hp vi d liu
c c v tch ta cc tri thc cn thit.
L giai on thit yu, trong cc phng php thng minh s c pdng trch xut ra cc mu d liu.
1.3.4. Pht biu v nh gi kt qu
Cc tri thc pht hin t c s d liu cn c tng hp di dng cc boco phc v cho cc mc ch h tr cc quyt nh khc nhau.
Do nhiu phng php khai thc c th c p dng nn cc kt qu c mc
tt, xu khc nhau. Vic nh gi cc kt qu thu c l cn thit, Cc tri thc phthin t c s d liu cn c tng hp di dng cc bo co phc v cho cc mcch h tr cc quyt nh khc nhau.
Do nhiu phng php khai thc c th c p dng nn cc kt qu c mc tt, xu khc nhau. Vic nh gi cc kt qu thu c l cn thit, gip to c scho nhng quyt nh chin lc. Thng thng, chng c tng hp, so snh bngcc biu v c kim nghim, tin hoc.
1.3.5. S dng tri thc pht hin
Cng c, tinh ch cc tri thc c pht hin. Kt hp cc tri thc thnhh thng. Gii quyt cc xung t tim tng trong tri thc khai thc c. Sau trithc c chun b sn sng cho ng dng.
Cc kt qu ca qu trnh pht hin tri thc c th c a vo ng dngtrong nhng lnh vc khc nhau. Do cc kt qu c th l cc d bo hoc cc m tnn chng c th c a vo cc h thng h tr ra quyt nh nhm t ng hoqu trnh ny.
1.4. Khai ph d liu c nhng li ch g
- Cung cp tri thc h tr ra quyt nh.- D bo.
- Khi qut d liu.
Hnh 1.3 L mt m hnh th hin li ch ca KPDL trong vic phn tch vra quyt nh cho vic ra tip th ca mt loi sn phm no
18
8/8/2019 Bao Cao Tot Nghiep KPDL
19/67
Tip th
CSDL
Tip th
KDD &Data MiningNh kho d liu
Hnh 1.3: M hnh li ch ca khai ph d liu
1.5. Cc k thut khai ph d liu
K thut khai ph d liu thng c chia lm 2 nhm chnh:
1.5.1. K thut khai ph d liu m t
C nhim v m t v cc tnh cht hoc cc c tnh chung ca d liutrong CSDL hin c. Cc k thut ny gm c: Phn cm (clustering), tm tt(summerization), trc quan ho (visualiztion), phn tch s pht trin v lch(Evolution and deviation analyst), phn tch lut kt hp (association rules).v.v.
1.5.2. K thut khai ph d liu d on
C nhim v a ra cc d on da vo cc suy din trn d liu hin thi.Cc k thut ny gm c: Phn lp (classification), hi quy (regression)
1.6. Nhim v chnh ca khai ph d liu
R rng rng mc ch ca khai ph d liu l cc tri thc chit xut c sc s dng cho li ch cnh tranh trn thng trng v cc li ch trong nghincu khoa hc.
Do , ta c th coi mc ch chnh ca khai thc d liu s l m t v don. Cc mu m khai ph d liu pht hin c nhm vo mc ch ny.
D onlin quan n vic s dng cc bin hoc cc trng trong c s dliu chit xut ra cc mu l cc d on nhng gi tr cha bit hoc nhng gitr trong tng lai ca cc bin ng quan tm.
M t tp trung vo vic tm kim cc mu m t d liu m con ngi cth hiu c.
t c hai mc ch ny, nhim v chnh ca khai ph d liu l:
- Phn lp (Classification).
- Hi qui (Regression).- Gom nhm (Clustering).
19
8/8/2019 Bao Cao Tot Nghiep KPDL
20/67
- Tng hp (Summarization).
- M hnh rng buc (Dependency modeling).
- D tm bin i v lch (Change and Deviation Dectection).
1.6.1. Phn lp (Classification)
Phn lp l vic phn loi mt mu d liu vo mt trong s cc lp xcnh.
Mc tiu ca thut ton phn lp l tm ra cc mi quan h no gia ccthuc tnh d bo v thuc tnh phn lp, t s dng mi quan h ny d bolp cho cc b d liu mi khc cng khung dng.
1.6.2. Hi quy (Regression)
Hi quy l vic l c mt hm nh x t mt mu d liu thnh mt bin d
on c gi tr thc. C rt nhiu ng dng khai ph d liu vi nhim v hi quy,v d nh bit cc php o vi sng t xa, nh gi kh nng t vong ca bnh nhn
bit cc kt qu xt nghim chn on, d on nhu cu tiu th mt sn phm mibng mt hm ch tiu qung co, v. v.
1.6.3. Gom nhm (Clustering)
L vic m t chung tm ra cc tp xc nh cc nhm hay cc loi m td liu. Cc nhm c th tch ring nhau hoc phn cp hoc gi ln nhau. C ngha lmt d liu c th va thuc nhm ny, va thuc nhm kia. Cc ng dng khai ph
d liu c nhim v gom nhm nh: Pht hin tp cc khch hng c phn ng gingnhau trong c s d liu tip th, xc nh cc loi quang ph t cc phng php otia hng ngoi.
1.6.4. Tng hp (Summarization)
Nhim v tng hp l vic sn sinh ra cc m t c trng cho mt lp. Ccm t ny l mt kiu tng hp, tm tt m t cc c tnh chung ca tt c cc bd liu dng gi mua hng thuc mt lp.
Cc m t c trng th hin di dng cc lut thng c khun dng:
Nu mt b d liu thuc v mt lp ch ra trong tin , th b d liu c ttc cc thuc tnh nu trong kt lun. Nhng lut ny c nhng c trng khc
bit so vi cc lut phn lp. Lut pht hin c trng cho mt lp ch c snsinh khi cc b d liu thuc v lp .
1.6.5. M hnh rng buc (Dependency modeling)
Bao gm vic tm kim mt m hnh m t s ph thuc ng k gia ccbin. Cc m hnh ph thuc tn ti di hai mc: Mc cu trc ca m hnh xcnh cc bin no l ph thuc cc b vi nhau, mc nh lng ca mt m hnh
xc nh mnh ca s ph thuc theo mt thc o no .
20
8/8/2019 Bao Cao Tot Nghiep KPDL
21/67
1.6.6. D tm bin i v lch (Change and Deviation Dectection)
Tp trung vo khai thc nhng thay i ng k nht trong d liu t ccgi tr chun hoc c o trc .
V cc nhim v khc nhau ny yu cu s lng v cc dng thng tin rtkhc nhau nn chng thng nh hng n vic thit k v chn gii thut khai
ph d liu khc nhau. V d nh gii thut to cy quyt nh to ra c mt mt phn bit c cc mu gia cc lp nhng khng c cc tnh cht v c imca lp.
1.7. Cc phng php khai ph d liu
Qu trnh khai ph d liu l qu trnh pht hin mu, trong , gii thutkhai ph d liu tm kim cc mu ng quan tm theo dng xc nh nh cc lut,cy phn lp, hi quy, gom nhm, v. v.
1.7.1. Cc thnh phn ca gii thut khai ph d liu
Gii thut khai ph d liu bao gm 3 thnh phn chnh nh sau: biudin m hnh, nh gi m hnh, tm kim m hnh.
Biu din m hnh: M hnh c biu din bng mt ngn ng L m t cc mu c th khai thc c. Tc l ngi phn tch d liu cn phi hiuy cc gi thit m t v cn phi din t c cc gi thit m t no c tora bi gii thut. M hnh s c nh gi bng cch a cc d liu th vo
m hnh v thay i li cc tham s cho ph hp nu cn. nh gi m hnh: nh gi xem mt mu c p ng c cc tiu
chun ca qu trnh pht hin tri thc hay khng. Vic nh gi chnh xc don da trn nh gi cho (Cross Validation). nh gi cht lng m t linquan n chnh xc d on, mi, kh nng s dng, kh nng hiu c cam hnh. C hai chun thng k v chun logic u c th c s dng nh gim hnh.
Phng php tm kim: Phng php tm kim bao gm hai thnh
phn: tm kim tham s v tm kim m hnh.- Tm kim tham s: ti u ha cc tiu chun nh gi m hnh vi
cc d liu quan st c v vi mt m t m hnh nh.
- Tm kim m hnh: Xy ra ging nh mt vng lp qua phng phptm kim tham s: M t m hnh b thay i to nn mt h cc m hnh.
= > Vi mi mt m t m hnh, phng php tm kim tham s c pdng nh gi cht lng m hnh. Cc phng php tm kim m hnh thngs dng cc k thut tm kim heuristic v kch thc ca khng gian cc m hnh
c th thng ngn cn cc tm kim tng th, hn na cc gii php n ginkhng d t c.
21
8/8/2019 Bao Cao Tot Nghiep KPDL
22/67
8/8/2019 Bao Cao Tot Nghiep KPDL
23/67
V d ch c s t ngi mua sch ting anh m mua thm a CD. S lng cc lutkt hp trong mt s c s d liu ln gn nh v hn. Do vy thut ton s khngth pht hin ht cc lut v khng phn bit c lut no l thng tin thc s cgi tr v th v.
Vy chng ta t ra cu hi l lut kt hp no l thc s c gi tr? Chnghn ta c lut: m nhc, ngoi ng, th thao = > CD, ngha l nhng ngi muasch m nhc, ngoi ng, th thao th cng mua a CD. Lc ta quan tm n slng trng hp khch hng tho mn lut ny trong c s d liu hay h trcho lut ny. h tr cho lut chnh l phn trm s bn ghi c c sch m nhc,ngoi ng, th thao v a CD hay tt c nhng ngi thch c ba loi sch trn.
Tuy nhin gi tr h tr l khng . C th c trng hp ta c mt nhmtng i nhng ngi c c ba loi sch trn nhng li c mt nhm vi lng ln
hn nhng ngi thch sch th thao, m nhc, ngoi ng m khng thch mua a CD.Trong trng hp ny tnh kt hp rt yu mc d h tr tng i cao. Nh vychng ta cn thm mt o th hai l tin cy (Confidence). tin cy l phntrm cc bn ghi c a CD trong s cc bn ghi c sch m nhc, th thao, ngoi ng.
Nhim v ca vic pht hin cc lut kt hp l phi tm tt c cc lutdng X => B sao cho tn s ca lut khng nh hn ngng Minsup cho trc v tin cy ca lut khng nh hn ngng Minconfi cho trc. T mt c s dliu ta c th tm c hng nghn v thm ch hng trm nghn cc lut kt hp.
1.7.2.4. Mng NeuronMng Neuron l tip cn tnh ton mi lin quan ti vic pht trin cu
trc ton hc v kh nng hc. Cc phng php l kt qu ca vic nghin cu mhnh hc ca h thng thn kinh con ngi.
Mng Neuron c th a ra ngha t cc d liu phc tp hoc khngchnh xc v c th c s dng chit xut cc mu v pht hin ra cc xuhng qu phc tp m con ngi cng nh cc k thut my tnh khc khng th
pht hin c. Khi cp n khai thc d liu, ngi ta thng cp nhiu n
mng Neuron. Tuy mng Neuron c mt s hn ch gy kh khn trong vic pdng v pht trin nhng n cng c nhng u im ng k.
23
8/8/2019 Bao Cao Tot Nghiep KPDL
24/67
MhnhmngNeuron
Muchit xutc
Dliu
Hnh 1.4.Th hin s khai ph d liu bng mng Neunon.
Mt trong s nhng u im phi k n ca mng Neuron l kh nngto ra cc m hnh d on c chnh xc cao, c th p dng c cho rt nhiu
loi bi ton khc nhau, p ng c nhim v t ra ca khai ph d liu nhphn lp, gom nhm, m hnh ha, d bo cc s kin ph thuc vo thi gian, v.v.
1.7.2.5. Gii thut di truyn
Gii thut di truyn, ni theo ngha rng l m phng li h thng tinha trong t nhin, chnh xc hn l gii thut ch ra tp cc c th c hnhthnh, c c lng v bin i nh th no? V d nh xc nh xem lm thno la chn cc c th to ging v la chn cc c th no s b loi b. Giithut cng m phng li yu t gen trong nhim sc th sinh hc trn my tnh
c th gii quyt nhiu bi ton thc t khc nhau.Gii thut di truyn l mt gii thut ti u ha. N c s dng rt
rng ri trong vic ti u ha cc k thut khai ph d liu trong c k thutmng Neuron. S lin h ca n vi cc qu trnh khai ph d liu. V d nh trongk thut cy quyt nh, to lut. Nh cp phn trc, cc lut m hnh had liu cha cc tham s c xc nh bi cc gii thut pht hin tri thc.
Giai on ti u ha l cn thit xc nh xem cc gi tr tham s noto ra cc lut tt nht. V v vy m gii thut di truyn c s dng trong cc
cng c khai ph d liu.1.8. ng dng ca khai ph d liu
Khai ph d liu l mt lnh vc lin quan ti nhiu ngnh hc khc nh:H CSDL, thng k, trc quan ho.v.v. Hn na, tu vo cch tip cn c sdng, khai ph d liu cn c th p dng mt s k thut nh mng nron, lthuyt tp th, tp m, biu din tri thc, v.v.So vi cc phng php ny, khai phd liu c mt s u th r rt.
So vi phng php hc my, khai ph d liu c li th hn ch, khai
ph d liu c th s dng vi cc CSDL cha nhiu nhiu, d liu khng y
24
8/8/2019 Bao Cao Tot Nghiep KPDL
25/67
hoc bin i lin tc. Trong khi phng php hc my ch yu c p dngtrong cc CSDL y , t bin ng v tp d liu khng qua ln.
Phng php h chuyn gia: Phng php ny khc vi khai ph d liu ch cc v d ca chuyn gia thng mc cao hn nhiu so vi cc d liu trong
CSDL, v chng thng ch bao hm c cc trng hp quan trng. Hn na ccchuyn gia s xc nhn gi tr v tnh hu ch ca cc mu pht hin c.
Phng php thng k l mt trong nhng nn tng l thuyt ca khaiph d liu, nhng khi so snh hai phng php vi nhau ta c th thy cc phngphp thng k cn tn ti mt s im yu m khai ph d liu khc phc c.
Cc phng php thng k chun khng ph hp vi cc kiu d liu ccu trc trong rt nhiu CSDL.
Cc phng php thng k hot ng hon ton theo d liu, n khng sdng tri thc c sn v lnh vc.
Kt qu phn tch ca h thng s rt nhiu v kh c th lm r ra c.
Phng php thng k cn c s hng dn ca ngi dng xc nhphn tch d liu nh th no v u.
Vi nhng u im , khai ph d liu hin ang c p dng mtcch rng ri trong nhiu lnh vc kinh doanh v i sng khc nhau nh:Marketing, ti chnh, ngn hng v bo him, khoa hc, y t, an ninh,
internet.v.v.rt nhiu t chc v cng ty ln trn th gii p dng k thut khaiph d liu vo cc hot ng sn xut kinh doanh ca mnh v thu c nhng lich to ln.
Mt s ng dng ca khai ph d liu trong lnh vc kinh doanh:
Brandaid: M hnh Marketing linh hot tp chung vo hng tiu dng.
Callpla: Gip nhn vin bn hng xc nh s ln ving thm ca khchhng trin vng v khch hng hin c.
Detailer: Xc nh khch hng no nn ving thm v sn phm no nn
gii thiu trong tng chuyn ving thm.Geoline: M hnh thit k a bn tiu th v dch v.
Mediac: Gip ngi qung co mua phng tin trong mt nm, lp khoch s dng phng tin bao gm phc ho khc th trng, c tnh tim nng.
1.9. Mt s thch thc t ra cho vic khai ph d liu
Cc c s d liu ln.
Thay i d liu v tri thc c th lm cho cc mu pht hin khng cnph hp na.
D liu b thiu hoc nhiu.
25
8/8/2019 Bao Cao Tot Nghiep KPDL
26/67
8/8/2019 Bao Cao Tot Nghiep KPDL
27/67
Chng II: TP PH BIN V LUT KT HP2.1. M u
Hin nay cc cng ty, doanh nghip ang lu tr mt lng thng tin ln v
bn hng. Mt bn ghi trong c s d liu ny cha cc thng tin v ngy mua bn,s lng hng bn,... T c s d liu bn hng, chng ta c th tm ra cc miquan h gia cc cp thuc tnh- gi tr thuc tnh. l lut kt hp tiu biu: Vd c 80% khch hng mua sch ngoi ng th s mua a CD hoc VCD .
2.2. Cc khi nim c bn
2.2.1. nh ngha 2. 2.1: Ng cnh khai ph d liu
Cho tp O l tp hu hn khc rng cc giao tc v I l tp hu hn khcrng cc mt hng,R l mt quan h hai ngi gia O vIsao cho vi oO v i I,
(o,i)R= > giao tc.o c cha mt hng i. Ng cnh khai ph d liu (di y sgi tt l NCKPDL) l b ba (O, I, R).
2.2.2. nh ngha 2. 2. 2: Cc kt ni Galois
Cho NCKPDL (O, I, R), xt hai kt ni Galois v c nh ngha nhsau:
: P (I) P (O) v : P (O) P (I):
Cho S I, (S) = {oO | iS, (o, i) R}
Cho X O, (X) = {i I | oX, (o, i) R}Trong P (X) l tp cc tp con ca X.
Cp hm (, ) c gi l kt ni Galois. Gi tr (S) biu din tp cc giaotc c chung tt c cc mt hng trong S. Gi tr (X) biu din tp mt hng ctrong tt c cc giao tc ca X.
2.2.3. nh ngha 2.2.3: h tr (Support)
2.2.3.1. h tr ca mt tp mc X trong c s d liu D l t s gia ccgiao tc T D c cha tp X l tng s giao tc trong D (hay l phn trm ca cc
giao tc trong D c cha tp mc X), k hiu l Supp (X).
Supp (X)={ }
D
TXDT :
Ta c 0 Supp (X) vi mi tp X.
Hay c th ni Support ch mc thng xuyn xy ra ca mu.
2.2.3.2. h tr ca lutXY l t s ca s giao tc c cha X Y v sgiao tc trong c s d liu D, k hiu l Supp (XY).
Supp (XY)=
{ }
D
TDT YX:
27
8/8/2019 Bao Cao Tot Nghiep KPDL
28/67
Nh vy h tr ca mt lut bng 50% ngha l c 50% s giao tc ccha tp mc X Y. h tr c ngha thng k ca lut kt hp.
2.2.4. nh ngha 2 2.4: tin cy ( Confidence)
2.2.4.1. Tnh cht 2. 2.4.1: H tr ca tp con.
Gi s A,B I l tp cc tp mc vi A B th Supp (A) Supp (B).
Tht vy, tnh cht ny c th suy ra trc tip t khi nim tp mc ph bin,v tt c cc giao tc h tr B th cng h tr A. Nh vy giao tc no cha tp mcB th cng cha tp mc A.
2.2.4.2. Tnh cht 2.2.4.2
Gi s A, B l hai tp mc, A, B I. Nu B l tp mc ph bin v A Bth A cng l tp mc ph bin.
Tht vy, nu B l tp mc ph bin th Supp (B) Minsup, mi tp mc Al tp mc con ca tp mc B u l tp mc thng xuyn trong c s d liu D vSupp (A) Supp (B) (Theo tnh cht 2.3.1).
2.2.4.3. Tnh cht 2.2.4.3
Gi s A, B l hai tp mc A B v A l tp mc khng ph bin th B cngl tp mc khng ph bin.
Tht vy, A l tp mc khng thng xuyn nn Supp (A) Minsup m A B nn Supp (A) Supp (B).
Suy ra Supp (B)< Minsup vy B l tp mc khng ph bin.
2.2.4.4. Tnh cht 2. 2.4.4
Gi s X, Y, Z I l nhng tp mc, sao cho X Y = . Th:
Conf (XY) Conf (X/ZY Z).
Tht vy, t X Y ZYX v X/Z X ta c:
( )( )ZXS u pZYXS u p p
\
( )( )XSupp
YXSupp
tin cy ca mt lut r = XY l t s (phn trm) ca s giao tc trong Dcha X Y vi s giao tc trong D c cha tp mc X. K hiu tin cy ca mtlut l Conf (r). Ta c 0 conf 1.
Nhn xt: h tr v tin cy chnh l xc sut sau:
28
8/8/2019 Bao Cao Tot Nghiep KPDL
29/67
Supp (XY) = P (X Y).
Conf (XY) = P (Y/X) = Supp (X Y)/Supp (X).
Ta ni rng vi lut c tin cy 85% th c ngha l 85% cc giao tc ccha X th cng cha Y. tin cy ca mt lut l th hin mc tng quantrong d liu gia hai tp X v Y. tin cy l o mc tin cy ca mt lut.
2.2.5. nh ngha 2.2.5: Tp mt hng ph bin
Cho NCKPDL (O, I, R) v Minsup (0, 1] l ngng ph bin ti thiu.Cho S I, ph bin ca S k hiu l SP (S) l t s gia s cc giao tc c chaS v s lng giao tc trong O. Ni cch khc SP (S)= | (S)| / |O|.
Cho S I, S l mt tp cc mt hng ph bin theo ngng Minsup nu vch nu SP (S) Minsup. Trong cc phn sau tp mt hng ph bin s c gi tt
l tp ph bin. K hiu FS (O, I, R, Minsup) = {S
P (I) | SP (S) Minsup).2.2.6. nh ngha 2.2.6: Lut kt hp
Cho NCKPDL (O, I, R) v ngng Minsup (0, 1]. Vi mt S FS (O, I,R, Minsup), gi X v Y l cc tp con khc rng ca S sao cho S = X Y v X Y
= . Lut kt hp X vi Y c dng XY phn nh kh nng khch hng mua tpmt hng Y khi mua tp mt hng X. ph bin ca lut kt hp XY vi S =XY l SP (S).
tin cy ca lut kt hp XY c k hiu l CF (XY) v c tnh
bng cng thc CF (XY) = SP (X Y)/SP (X)Nguyn l Apriori.
Cho S FS (O, I, R, Minsup), nu T S th T FS (O, I, R, Minsup).
Cho T FS (O, I, R, Minsup), nu T S th S FS (O, I, R, Minsup).
2.2.6.1. Tnh cht 2.2.6.1: Lut kt hp khng c hp thnh.
Nu XY v Y Z tho mn trn D th khng nht thit X Y Z l ng.
Tht vy, nu xt trng hp X Y= v cc giao dch trn D h tr Z khi
v ch khi chng h tr X hoc h tr Y. Khi Supp (X Y) = 0 v Conf (X Y) = 0.
Tng t, trng hp c X Y v X Z, ta suy ra X Y Z.
2.2.6.2. Tnh cht 2.2.6.2: Lut kt hp khng c tnh tch.
Nu X Y Z th X Z v Y Z cha chc xy ra.
Chng hn xt trng hp Z c mt trong giao tc ch khi c tp X v Y cngc mt, tc l Supp (X Y) = Supp (Z). Nu h tr X, Y ln hn
Supp (X Y) tc l Supp (X) Supp (X Y) v Supp (Y) Supp (X Y ) th
hai lut ring bit s khng h tr.
29
8/8/2019 Bao Cao Tot Nghiep KPDL
30/67
Tuy nhin trng hp ngc li X Y Z th suy ra c X Y v X Z.
gii thch cho tnh cht ny ta phn tch v d sau:
Hnh 2.5. Minh ha lut kt hp khng c tnh tch
Khi Z th hin trong mt giao dch ch nu c X v Y u th hin giao dch, ngha l Supp (X Y) = Supp (Z). Nu Supp cho X v Y ln hn Supp (X Y), th hai lut trn s khng c Conf yu cu. Nhng nu X Y Z tha mn
trn D th c th suy ra X Y v X Z cng tha mn trn D v Supp (XY) Supp (XYZ) v Supp (XZ) Supp (XYZ).
2.2.6.3. Tnh cht 2.2.6.3: Lut kt hp khng c tnh bc cu.
Nu XY v Y Z tho mn trn D th khng th khng nh X Ztho mn trn D.
Gi s T (X) T (Y) T (Z) v Conf (X Y) = Conf (Y Z) = Minconf.
Khi Conf (X Z) = Minconf2 < Minconf (vi 0 < Minconf < 1), suy ra lut X Z khng c Conf ti thiu, tc l X Z khng tho mn trn D.
2.2.6.4. Tnh cht 2.2.6.4
Nu lut X (L-X) khng tho mn tin cy cc tiu th lut Y (L-Y) cng khng tho mn vi cc tp mc Y X L.
2.3. Tm tp ph bin
2.3.1. Mt s khi nim
Cho NCKPDL (O, I, R) v Minsup (0, 1], tm FS (O, I, R, Minsup).Thut ton c xy dng da trn nguyn l Apriori. u tin thut ton s tm
cc tp ph bin c mt phn t. Sau cc ng vin ca cc tp ph bin c haiphn t s c to lp bng cch hp cc tp ph bin c mt phn t. Mt cch
T(Y)
T(Z)
T(X)
30
8/8/2019 Bao Cao Tot Nghiep KPDL
31/67
tng qut, cc tp ng vin ca tp ph bin c k phn t s c to t cc tp phbin c k-1 phn t. Gi Fk = {S P (I) | SP (S) Minsup v |S|= k}. Thut ton sduyt tng ng vin to Fk bao gm cc ng vin c ph bin ln hn hoc
bng ngng Minsup.
- Tp cc hng mc (Itemset)I= {i1, i2, , im}:
VD :I= {sa, bnh m, ng cc, sa chua}
Tp k hng mc (k-Itemset).
- Giao dch t : tp cc hng mc sao cho t I
VD : t = {bnh m, sa chua, ng cc}
- CSDL D = {t1, t2, , tn}, ti= {ii1, ii2, , iik} vi iij I : CSDL giao dch
- Giao dch t cha X nu X l tp cc hng mc trong I v X t
VD : X = {bnh m, sa chua}
- ph bin (supp) ca tp cc hng mc X trong CSDL D l t l gia scc giao dch cha X trn tng s cc giao dch trong D.
Supp (X) = count (X) / | D |
Tp cc hng mc ph bin S hay tp ph bin (Frequent Itemset) l tpcc hng mc c ph bin tha mn ph bin ti thiu
Nu Supp (S) Minsup th S - tp ph bin.
- Tnh cht ca tp ph bin (Apriori).Tt c cc tp con ca tp ph bin u l tp ph bin.
2.3.2. Thut ton Apriori
2.3.2.1. M t thut ton
u tin thc hin duyt CSDL tm cc mc ring bit trong CSDL v h tr tng ng ca n. Tp thu c l C1. Duyt tp C1 loi b cc mc c h tr< Minsup, cc tp mc cn li ca C1 l cc tp 1-Itemset (L1) ph bin. Sau kt niL1 vi L1 c tp cc tp 2-Itemset C2. Duyt CSDL xc nh h tr ca cc tp
mc trong C2. Duyt C2 Loi b cc tp mc c h tr < Minsup, cc tp mc cnli ca C2 l tp cc tp 2-Itemset (L2) ph bin. L2 li c s dng sinh ra L3 v c
tip tc nh vy cho n khi tm c tp mc k-Itemset L km Lk= (tc l khng ctp mc ph bin no c tm thy) th dng li.
Tp cc tp mc ph bin ca CSDL l: ki-1= L1.
tng hiu qu ca thut ton trong qu trnh sinh cc tp mc ng c, tas dng tnh cht ca tp mc ph bin lm gim s lng tp cc tp ng c,khng phi l tp ph bin c sinh ra. Tnh cht l: Tp cc tp con khc rngca tp mc ph bin u l tp mc ph bin.
31
8/8/2019 Bao Cao Tot Nghiep KPDL
32/67
Bc ni:
Input: Tp Lk+1
l tp (k+1)-Itemset ph bin.
Output: Tp Ck l tp cc ng c vin cho tp mc ph bin k-Itemset
Tp cc ng c k-Itemset c sinh ra t vic kt ni Lk-1 vi chnh n. Gis l1, l2 l cc tp mc ca Lk-1. Ta k hiu lj[i] l mc th Itemset trong tp mclj,vic kt ni Lk-1 vi Lk+1 c thc hin nh sau: Cc tp mc ca Lk-1 c kt nivi nhau nu mc u ca chng trng nhau v l1[k-1]
8/8/2019 Bao Cao Tot Nghiep KPDL
33/67
2.3.2.2. V d minh ho cho thut ton Apriori
Xt CSDL giao dch D c cho trong bng sau:
Bng 2.1. CSDL s dng minh ho thut ton Apriori
TID Danh sch cc mc
1 I1 I2 I5
2 I2 I4
3 I2 I3
4 I1 I2 I4
5 I1 I3
6 I2 I3
7 I1 I3
8 I1 I2 I3 I5
9 I1 I2 I3
Trong ln lp u tin ca thut ton, mi mc l mt thnh vin ca tp ngc C1. Thut ton thc hin qut tt c cc giao dch ca D theo m s s ln
xut hin ca mi mc.Gi s h tr cc tiu m s giao dch l 2 (tc l Minsup = 2/9*100% =
22%). Khi tp mc ph bin 1-Itemset (L1), c xc nh nh sau: L1 bao gmtt c cc ng c 1-Itemset tho mn h tr ti thiu.
Tm ra cc tp mc ph bin 2-Itemset (L2), thut ton s dng kt ni L1 viL1 sinh ra tp ng c 2-Itemset (C2). C2 bao gm t hp chp lj[i] ca cc phn tc trong L1 do s lng cc phn t ca C2 c tnh nh sau:
|C2| = C2
|1|L = C2
5 =!3!2
!5= 10
Tip theo, qut cc giao dch trong D v tnh h tr ca cc tp ng ctrong C2.
Tp mc ph bin 2-Itemset L2 c xc nh, bao gm cc tp mc 2-Itemset l ng c trong C2 c h tr ln hn hoc bng h tr ti thiuMinsup.
Sinh cc tp ng c 3-Itemset, CLk-1bng cch, kt ni L2 vi chnh n tanhn c kt qu C3 l:
C3 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}
33
8/8/2019 Bao Cao Tot Nghiep KPDL
34/67
S dng tnh cht Apriori ta bt cc ng c: Tt c cc tp con ca tpph bin l tp ph bin. Do 4 ng c vin ca tp C3 khng th l tp ph binv n cha cc tp khng ph bin, ta thc hin ta (loi) bn tp ng c vin khi C3. C th nh sau:
+ Cc tp {I1, I3, I5}, {I2, I3, I5} khng l ph bin v tp con {I3, I5} ca nkhng ph bin (khng c trong L2).
+ Tp {I2, I3, I4} khng l ph bin v tp con {I3,I4} ca n khng ph bin(khng c trong L2).
+ Tp {I2, I4, I5} khng l ph bin v tp con {I4, I5} ca n khng ph bin(khng c trong L2).
Vic ta bt cc tp ng c ny s lm gim bt vic phi qut CSDL tnh h tr khi xc nh L3. Lu rng, vi ng c k-Itemset, chng ta ch cn kimtra tp con (k-1)-Itemset c l ph bin hay khng? V thut ton Apriori s dngchin lc tm kim theo chiu rng.
Nh vy sau khi thc hin kt ni v ta ta thu c kt tp C3 l:
C3 = {{I1, I2, I3}, {I1, I2, I5}}
Qut cc giao dch trong CSDL xc nh L 3, L3bao gm cc ng c 3-Itemset trong C3 tho mn h tr ti thiu. Ta c L3 l:
L3 = {{I1, I2, I3}, {I1, I2, I5}}
Sinh cc tp ng c 3-Itemset, C3bng cch kt ni L3 vi chnh n ta nhnc kt qu C4 l tp mc {I1, I2, I3, I5}. Sau thc hin bc ta th tp {I1, I2, I3,I5} b ta v n cha tp con {I 2, I3, I5} khng l tp ph bin (khng c trong L3).
Nh vy ta c C4 = n y thut ton kt thc. Vy tp hp tt c cc tp mcph bin c tm.
Cc tp mc ph bin tm c t CSDL giao dch D vi h tr ti thiuMinsup = 22% ( h tr ti thiu tng ng vi s giao dch = 2).
Bng 2. 2. Kt qu thc hin thut ton Aprori cho CSDL D
Loai tp mc ph bin Cc tp mc ph bin
1-Itemset {I1} {I2} {I3} {I4} {I5}
2-Itemset {I1, I2} {I1, I3} {I1, I5} {I2, I3} {I2, I4} {I2, I5}
3-Itemset {I1, I2, I3} {I1, I2, I5}
2.3.2.3. Procedure-Code.
Input : CSDLD, Minsup
Output :L : cc tp ph bin trongD
Ck : Tp ng vin kch thc k
34
8/8/2019 Bao Cao Tot Nghiep KPDL
35/67
8/8/2019 Bao Cao Tot Nghiep KPDL
36/67
2.4. Tm lut kt hp
Gi I = {I1, I2,...., Im} l tp m thuc tnh ring bit, mi thuc tnh gi l mtmc. Gi D l mt c s d liu, trong mi bn ghi T l mt giao dch v cha
cc tp mc, T I.
nh ngha 2.4 1: Mt lut kt hp l mt quan h c dng X Y, trong
X, Y I l cc tp mc gi l Itemsets, v =YX . y, X c gi l tin ,Y l mnh kt qu.
Thng s quan trng nht ca lut kt hp l h tr (s) v tin cy (c).
nh ngha 2.4.2: h tr (Support) ca lut kt hp XYl t l phntrm cc bn ghi YX vi tng s cc giao dch c trong c s d liu.
nh ngha 2.4.3: i vi mt s giao dch c a ra, tin cy
(confidence) l t l ca s giao dch c cha YX vi s giao dch c cha X.n v tnh %.
Vic khai thc cc lut kt hp t c s d liu chnh l vic tm tt c cclut c h tr v tin cy ln hn ngng ca h tr v tin cy do ngis dng xc nh trc. Cc ngng ca h tr v tin cy c k hiu lMinsup v Mincof.
Vic khai thc cc lut kt hp c phn tch thnh hai vn sau y:
- Tm tt c cc tp mc thng xuyn xy ra m c h tr ln hn hoc
bng Minsup.- To ra cc lut mong mun s dng cc tp mc ln m c tin cy lnhn hoc bng Mincof.
2.4.1. Pht biu bi ton khai ph lut kt hp
I = {i1, i2, , in } l tp bao gm n mc (Item cn gi l cc thuc tnh -attribute).X I c gi l tp mc (Itemset).
T = {t1, t2, .v.v.tm} l tp gm m giao dch (Transasction cn gi l bn ghi- Record), mi giao dch c nh danh bi TID (Transaction Identification).
R l mt quan h nh phn trn I v T. Nu giao dch t c cha mc I th tavit (i, t) R.(T, I, R) l ng cnh khai thc d liu. Mt CSDL D, v mt hnhthc, chnh l mt quan h nh phn R nh trn.
V ngha, mt CSDL l mt tp cc giao dch, mi giao dch t l mt tpmc, t 2I (2I l tp cc tp con ca I).
V d v CSDL giao dch: I = {A, B, C, D, E}, T = {1, 2, 3, 4, 5, 6}
Thng tin v cc giao dch cho bng sau:
36
8/8/2019 Bao Cao Tot Nghiep KPDL
37/67
8/8/2019 Bao Cao Tot Nghiep KPDL
38/67
Ngng Minconf phn nh mc xut hin ca Y khi cho trc X.( ( c Minconf) (Minimum Confidence))
Lut kt hp cn tm l lut kt hp tho mn Minsup v Minconf cho trc.Chng ta ch quan tm n cc lut c h tr ln hn h tr ti thiu v tin
cy ln hn tin cy ti thiu.
Hu ht cc thut ton khai ph lut kt hp thng chia thnh 2 pha:
Pha 1 : Tm tt c cc tp mc ph bin t c s d liu tc l tm tt c cctp mc X tho s (X) Minsup.
Pha 2: Sinh cc lut tin cy t cc tp ph bin tm thy pha 1.
Nu X l mt tp mc ph bin th lut kt hp c sinh ra t X c dng:
X c X \ X, trong :
X l tp con khc rng ca X.X\X l hiu ca hai tp hp X v X.
c l tin cy ca lut tho mn c Minconf
Vi tp mc ph bin trong bng 4 th chng ta c th sinh lut kt hp sau y:
Bng 2.5. Lut kt hp sinh t tp mc ph bin ABE
Lut kt hp tin cy c Minconf ?
A %1 0 0 BE C
B %6 7
AE KhngE %8 0 AB C
AB %1 0 0 E C
AE %1 0 0 B C
BE %8 0 A C
Tp ph bin ti i: Cho M FX (T, I, R, Minsup) M c gi l tp mcph bin ti i nu khng tn ti X FX (T, I, R, Minsup), M X, M X
2.4.2. Pht trin gii php hiu qu trong khai thc lut kt hpBi ton lut kt hp.
Cho mt tp cc gi tr I, mt CSDL giao dch D, ngng h tr ti thiu
Minsup, ngng tin cy Mincof, tm cc lut kt hp dng XY trn D tho
mn iu kin Support (XY) >= Minsup v Confidence (X Y) >= Mincof.
Tin trnh khai thc lut kt hp.
Xc nh cc tp mc ln Vic xc nh cc tp mc ln gm c hai bcchnh sau y:
Xc nh cc tp ng c vin (Ck).
38
8/8/2019 Bao Cao Tot Nghiep KPDL
39/67
Xc nh cc tp mc ln (L) da vo tp ng c vin
xc nh tp ng c vin, ta thc hin cc bc sau y:
1. Tm cc tp ng c vin mt mc.
2. Qut CSDL D xc nh h tr ca cc tp ng c vin. Trong vngu tin, cc tp ng c vin cng chnh l tt c cc mc c trong CSDL. Ti vngth k (k>1), cc tp ng c vin c xc nh da vo cc tp mc ln xc nhti vng k 1, s dng hm Apriori-Gen () Sau khi xc nh c cc tp ng cvin, thut ton qut tng giao dch trong CSDL tnh h tr ca cc tp ngc vin. Qu trnh xc nh cc tp mc s kt thc khi khng xc nh c thmtp mc ln no na.
3. Ni dung hm Apriori-gen (). Hm Apriori-gen () thc hin hai bc.
1.Bc u tin, Lk 1 c kt ni vi chnh n thu c Ck.2.Bc th hai, Apriori_gen () xo tt c cc tp mc t kt qu kt ni mc mt s tp con (k 1) khng c trong L k 1. Sau n tr v tp mc ln kchthc k cn li.
3. Sinh cc lut kt hp t tp mc ln.
Vic pht hin cc tp mc ln l rt tn km v mt tnh ton. Tuy nhin,
ngay khi tm c tt c cc tp mc ln (l L), ta c th d dng sinh ra cc lutkt hp c th c bng cc bc nh sau:
1. Tm tt c cc tp con khng rng x, ca tp mc ln l L.2. Vi mi tp con x tm c, ta xut ra lut dng x (l - x) nu t l
Support (l)/Support (x) >= Mincof (%).
3. Th tc sinh ra cc tp con.
4. u vo:
5. Tp mc ln Lk
u ra:
Tp lut tho mn iu kin tin cy >= Mincof v h tr >= MinsupPhng php:
Forall Lk, k >= 2 do
Call Genrules (Lk, Lk);
Procedure Genrules (Lk: Large k-Itemset, am: Large m-Itemset)
A= { (m-1)-Itemset am-1| am-1 am}
Forall am-1A do begin
Conf = Support (Lk)/Support (am-1)If (Conf >= Mincof) then begin
39
8/8/2019 Bao Cao Tot Nghiep KPDL
40/67
Output the rule am-1(Lk am-1)
vi Confidence = Mincof and Support = Support (Lk)
If (m-1 >1) then Call Genrules (Lk, am-1);
End;End;
4. Gii php hiu qu
Trong cc phn trn, trnh by tin trnh c bn khai thc cc lut kt hptrong CSDL, song vn cn phi quan tm nghin cu l tng hiu qu ca thut tontrong trng hp: S lng tp ng c vin c tm thy l rt ln.
Trong phm vi nghin cu ca bi ny, s a ra mt gii php mi giiquyt vn nu.
Ta cc ng c vin: Vic ta cc ng c vin nhm mc ch b i cc tp ngc vin khng cn thit, rt gn s lng ca tp cc tp ng c vin. Sau y, s trnh
by k thut ta cc ng c vin khng cn thit.
K thut ny c tinh cht: Cc mc trong tp ng c vin c sp xp theo tht.
Ni dung k thut:
Forall Itesets c Ckdo
Forall (k 1) subsets s of c doIf (s Lk 1) then
Delete c from Ck
Da vo y, ta c th ta c cc tp ng c vin, t c th gii hn mintm kim ca n trn tt c cc tp mc.
2.5. Quy trnh khai thc lut kt hp
B1 : Tm tt c cc tp ph bin (theo ngng Minsup).
B2 : To ra cc lut t cc tp ph bin.
i vi mi tp ph bin S, to ra tt c cc tp con khc rng ca S.
i vi mi tp con khc rng A ca S.
Lut A (S - A) l LKH cn tm nu:
conf (A(S - A)) = supp (S) / supp (A) Minconf
2.6. Mt s thut ton khc
2.6.1. Thut ton khai ph song song cho lut kt hp m
Theo bi ton khai ph lut kt hp m tun t trong phn trn, mi thuctnh iu trong I c gn vi tp cc tp m Fi u nh sau:
40
8/8/2019 Bao Cao Tot Nghiep KPDL
41/67
Fi u = {f1
iu , f2
iu , ,fk
iu }
t tp FN = {k1} {k2} .v.v. {kn} = {s1, s2, sv} (v n v c th tnti nhng cp ki v kj ging nhau) v N l s lng BXL trong h thng, bi ton
phn chia tp thuc tnh m cho cc BXL nh sau:
Tm mt tp con Fn khc rng ca tp FN sao cho tch cc phn t trong Fnbng s lng BXL trong h thng. Trong trng hp khng tm thy nghim ngth thut ton s tr v mt nghim chp nhn c tc l tch ca cc phn ttrong Fn l xp x di ca N.
Gi s s = {k1, k2, km} l mt nghim ca thut ton phn chia (ngha l k1*k2* *km = N). Lc , s lng thuc tnh m gim c ti cc BXL so vi thut tontun t l (k1 1 ) + (k2 1 ) + + (km 1 ) = (k1, + k2,, km m ). Nghim ti u lnghim c gi tr ca biu thc (k1 + k2 + + km m ) t cc i, tc l s thuc tnhgim c cng nhiu cng tt. d tm kim nghim ti u, tp FN trc tin phic sp xp gim dn. y l chin lc rt quan trng bi ta bit thi gian x l sgim theo hm m khi gim dn s lng thuc tnh m. Mt chin lc khc tmnghim ti u l trong sut qu trnh tm kim thut ton chia phi tham chiu n h tr ca cc thuc tnh m ( c cp nht sau khi thc hin hm Counting) xtxem chng ta nn phn chia theo thuc tnh no. Thuc tnh c phn chia phi c h tr ca cc thuc tnh m tng i cn bng. Chin lc ny gip cn bng tagia cc BXL trong h thng cc bc tip theo. Bi ton ny c th gii quyt bng
chin lc quay lui (c quy hoc khng). Bng di y c trnh by theo cchvit khng quy.
Thut ton
Boolean Subset (FN,N,Idx)
k = 1;
Idx[1] = 0;
S = 0;
While (k > 0) {Idx[k] ++;
If (Idx[k]
8/8/2019 Bao Cao Tot Nghiep KPDL
42/67
k + +;
}
}
} else {
K--;
S= FN[Idx[k]];
}
}
Return False;
FindSubset (FN, N, Idx, Fn)
for (n = N; n > 0; n --)
If (Subset (FN, n, Idx)) {
Fn = {FN[i] | i Idx}
Return;
1 }
2.6.2. Thut ton FP-Growth
2.6.2.1 Bn cht.
- Khai thc tp ph bin khng s dng hm to ng vin.
- Nn CSDL thnh cu trc cy FP (Frequent Patern)
- Duyt qui cy FP to tp ph bin
2.6.2.2. Qui trnh.
B1 : Thit lp cy FP
B2 : Thit lp c s mu iu kin (Conditional Pattern Bases) cho mi hngmc ph bin (mi nt trn cy FP).
B3 : Thit lp cy FP iu kin (Conditional FP tree) t mi c s mu iu
kinB4 : Khai thc qui cy FP iu kin v pht trin mu ph bin cho n
khi cy FP iu kin ch cha 1 ng dn duy nht - to ra tt c cc t hp camu ph bin
a- Thit lp cy FP (B1)
TID Items bought (ordered) Frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
42
8/8/2019 Bao Cao Tot Nghiep KPDL
43/67
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} Minsupp = 60%
- Tm tp ph bin 1- hng mc (duyt CSDL 1 ln)
- Sp xp tp ph bin gim dn vo trong F-List
F-List = f-c-a-b-m-p
- Duyt CSDL ln na v thit lp cy FP
Bng 2.6. Cy FP
TIDItems bought
(ordered) frequentitems
100{f, a, c, d, g, i,
m, p} {f, c, a,m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} Minsupp = 3
- Tm tp ph bin 1- hng mc (duyt CSDL 1 ln)
- Sp xp tp ph bin gim dn vo trong F-List
F-List = f-c-a-b-m-p
- Duyt CSDL ln na v thit lp cy FP
Bng 2.7. Cy FP
Header Table
Item frequency head
f 4
c 4
a 3
b 3
m 3
p 3
43
8/8/2019 Bao Cao Tot Nghiep KPDL
44/67
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} Minsupp = 3
- Tm tp ph bin 1- hng mc (duyt CSDL 1 ln)
- Sp xp tp ph bin gim dn vo trong F-List
F-List = f-c-a-b-m-p
- Duyt CSDL ln na v thit lp cy FP
Bng 2.8. Cy FP
Header TableItem frequency head
f 4c 4a 3b 3m 3
p 3
{}
f:1
c:1
a:1
m:1
p:1
44
8/8/2019 Bao Cao Tot Nghiep KPDL
45/67
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} Minsupp = 3
- Tm tp ph bin 1- hng mc (duyt CSDL 1 ln)
- Sp xp tp ph bin gim dn vo trong F-List
F-List = f-c-a-b-m-p
- Duyt CSDL ln na v thit lp cy FP
Bng 2.9. Cy FP
m:1
Header TableItem frequency head
f 4c 4a 3b 3m 3
p 3
{}
f:2
c:2
a:2
b:1
p:1 m:1
45
8/8/2019 Bao Cao Tot Nghiep KPDL
46/67
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} Minsupp = 3
- Tm tp ph bin 1- hng mc (duyt CSDL 1 ln)
- Sp xp tp ph bin gim dn vo trong F-List
F-List = f-c-a-b-m-p
- Duyt CSDL ln na v thit lp cy FP
Bng 2.10. Cy FP
Header Table
Item frequency headf 4c 4a 3b 3m 3
p 3
{}
f:3
c:2
a:2
b:1m:1
p:1 m:1
b:1
46
8/8/2019 Bao Cao Tot Nghiep KPDL
47/67
b-Thut ton FP- Growth (B2)
- Bt u t mu ph bin cui bng ca cy FP- Duyt cy FP theo kt ni ca mi hng mc ph bin p
p:1
m:1
{}
f:4 c:1Header TableItem frequency headf 4c 4a 3b 3m 3
p 3
b:1b:1c:3
a:3
b:1m:2
p:2
V d cy FP
A:9
B:3
C:2
E:1
C:2
E:2
B:5
E:1
D:1
D:1
C:2
D:1
Nu
ll
E:1
A 9B 8C 6E 5D 3
A BB A CA B DE B AA CA B CB CB C DB E
E AA C EA D E
Minsupp = 25%
Nu Minsupp = 40% th cy FP s nh th no ?
47
8/8/2019 Bao Cao Tot Nghiep KPDL
48/67
- Gom tt c ng dn tin t bin i (Transformed Prefix ) ca hng mc p to c s mu iu kin ca p
Bng 2.11. Cy FP
Conditionalpattern bases
item cond. Pattern base
c f:3
a fc:3
b fca:1, f:1, c:1
m fca:2, fcab:1
p fcam:2, cb:1
c- Thut ton FP- Growth (B3)
Vi mi c s mu :
- m s lng mi mu trong c s mu
- Thit lp cy FP cho tp ph bin ca mu c s
VD : Gi s c c mu iu kin cho p: {fcam:2, cb:1}
p-Conditional FP-tree
Bng 2.12. Cy FP
Tt c mu ph bin linquan n p l :
p,cp
Vi mi c s mu :- m s lng mi mu trong c s
- Thit lp cy FP cho tp ph bin ca mu c s
V d : m-Conditional Pattern Base: fca:2, fcab:1
48
{}
Header TableItem frequency head
c 3c:3
Header TableItem frequency head
f 4c 4a 3
b 3m 3
p 3
{}
f:4 c:1
b:1
p:1
c:3
a:3
b:1m:2
p:2 m:1
8/8/2019 Bao Cao Tot Nghiep KPDL
49/67
{}
f:3
c:3
a:3
m-Conditional FP-Tree
Bng 2.13. Cy FP
Tt c mu ph bin lin quann m l :
m,fm, cm, am,
fcm, fam, cam,
fcam
b:1
p:1
Header TableItem frequency headf 4c 4a 3b 3m 3
p 3
{}
f:4 c:1
b:1c:3
a:3
b:1m:2
p:2 m:1
49
8/8/2019 Bao Cao Tot Nghiep KPDL
50/67
V d
Bng 2.14.C s d liu
d- Thut ton FP- Growth (B4)
- Gi s cy FP T c mt ng dn n (Single Path) P
- Tp mu ph bin cui cng ca T sinh ra bng cch lit k tt c cc t hp caSub-Paths thuc P
{ }{ }f
{ (f:3) } | c{ (f:3) }c
{ (f:3, c:3) } | a{ (fc:3) }a
{ }{ (fca:1), (f:1), (c:1) }b
{ (f:3, c:3, a:3) } | m{ (fca:2), (fcab:1) }m
{ (c:3) } | p{ (fcam:2), (cb:1) }p
Conditional FP-treeConditional pattern-baseItem
Cond. pattern base of am: (fc:3) {}
f:3
c:3
am-conditionalFP-tree
Cond. pattern base of cm: (f:3)
cm-conditionalFP-tree
{}
f:3
cam-conditionalFP-tree
Cond. pattern base of cam: (f:3)
{}
f:3
c:3
a:3m-conditionalFP-tree
50
8/8/2019 Bao Cao Tot Nghiep KPDL
51/67
2.6.2.3. Thut ton FP_Growth
Pocedure FP_Growth (Tree, )
If cy FP cha 1 path P then
For mi t hp ca nt trn P
To mu vi Supp = Suppmin (cc nt trong );
Else for mi i trn header ca cy
To mu = i vi supp = i . Supp ;
Thit lp s Conditional Pattern base and s Conditional FP-Tree Tree
If Tree , gi FP_Growth (Tree , ).* Kt lut chng II:
Qua chng II chng ta bit c vic p dng cc thut ton vo cc lnh vc cai sng x hi, n c vai tr rt quan trng trong vic xy dng nhng h h tr raquyt nh. Khai ph lut kt hp l mt hng i ang c hon thin. c thp dng lut kt hp trc tin ta phi tin hnh m ho c s d liu hin c, yl mt bc quan trng, quyt nh c th sinh lut kt hp tt hay khng.
Thut ton Apriori tm tp mc ph bin theo hng sinh ng c.
Thut ton FP_Growth tm tp mc ph bin theo hng khng sinh ng c.
Trn c s l tp ph bin tm c ta p dng thut ton khai ph lut kt hp sinh ra tp lut kt hp ng tin.
{}
f:3
c:3
a:3
m-Conditional FP-Tree
Tt c mu ph bin linquan n mm,fm, cm, am,fcm, fam, cam,fcam
51
8/8/2019 Bao Cao Tot Nghiep KPDL
52/67
Chng III: CI T V TH NGHIM THUT
TON TM TP PH BIN V LUT KT HP3.1. Pht biu bi ton.
Vi s pht trin ca nn kinh t hin nay, th vic kinh doanh ang l vn c rt nhiu ngi quan tm. X hi cng pht trin th trnh con ngi ngycng c nng cao. V vy pht trin gio dc ang l vn m x hi rt quantm, ln vic kinh doanh cc ti liu v sch gio khoa, sch tham kho, dnghc tp,..ang l mt hng i ng. Nhng kinh doanh tt th ngi kinh doanh
phi bit qun l n nh th no cho ng v hp l nht.
T nhng iu thit ngha phi c mt phn mm qun l bn sch, htr cho ngi qun l trong vic la chn cc u sch bn. V d khi bn schgio khoa th bn km thm sch tham kho v dng hc tp g? Chng c linquan ti nhau nh th no?
Lut kt hp cho ta bit vic la chn cc loi sch g bn, gip ngiqun l a ra quyt nh nhanh, chnh xc v hiu qu nht.
3.2. La chn thut ton ci t phn mm.
C rt nhiu thut ton a ra vic la chn cc u sch trong vic qunl bn sch, nhng chng em la chn thut ton Apriori ci t.
Mc ch ca thut ton ny l a ra cc lut kt hp trong vic la chncc u sch bn. V d khi bn sch Ton th bn km thm sch L, Ho.
3.3. Yu cu khi ci t thut ton.
- V my tnh:
+ Cu hnh ti thiu Ram 256.
+ cng 2G cn trng.
+ CPU P4 1.7Ghz
- V phn mm:
+ Ci t Visual Studio 2005
+ DOT.NET 2.0.
52
8/8/2019 Bao Cao Tot Nghiep KPDL
53/67
3.4. C s d liu.
3.4.1. Giao din chnh ca c s d liu.
Hnh 3.1. Giao din chnh ca c s d liu
M t mt s chc nng trong giao din:
+ H Thng: C chc nng thot khi chng trnh.
+ DM khch hng: C chc nng thm, lu, sa, xa d liu cho khch hng.
+ DM hng: : C chc nng thm, lu, sa, xa d liu cho hng ha.
+ DM ha n: : C chc nng thm, lu, sa, xa d liu cho ha n.
+ DM Nh CC: : C chc nng thm, lu, sa, xa d liu cho nh cung cp.
+ Apriori: C chc nng ghi d liu vo file XML.
53
8/8/2019 Bao Cao Tot Nghiep KPDL
54/67
3.4.2. Bng danh mc cc Nh cung cp hng ha.
Cu trc v d liu ca bng nh sau:
Hnh 3.2. Danh mc nh cung cp
Mt s thuc tnh ca bng l:
+ MaNCC: M nh cung cp hng ha.
+ TenNCC: Tn nh cung cp hng ha.
+ DiaChi: a ch ca nh cung cp hng ha.
+ DienThoai: in thoi ca nh cung cp.
+ MaSoThue: M s thu nh cung cp hang ha.+ Email: Email cua nh cung cp
54
8/8/2019 Bao Cao Tot Nghiep KPDL
55/67
3.4.3. Bng danh mc cc Hng Ho.
Cu trc v d liu ca bng hng ho nh sau:
Hnh 3.3. Danh mc hng ha
Mt s thuc tnh ca bng l:
+ MaH: M hng ho.
+ MaNCC: M nh cung cp hng ho.
+ TenHang: Tn hng ho.
+ MoTa: M t hng ha.
+ ChungLoai: Chng loi hng ha.
55
8/8/2019 Bao Cao Tot Nghiep KPDL
56/67
3.4.4. Bng danh mc cc Khch Hng.
Cu trc v d liu bng khch hng nh sau:
Hinh 3.4.Danh mc khch hng
Mt s thuc tnh ca bng l:
+ MaKH: M khch hng.
+ TenKH: Tn khch hng.
+ SoCMND: S chng minh nhn dn.
+ DiaChi: a ch khch hng.
+ DienThoai: in thoi khch hng.
+ Email: Email ca khch hng
56
8/8/2019 Bao Cao Tot Nghiep KPDL
57/67
3.4.5. Bng danh mc cc Ho n.
Cu trc v d liu ca bng ha m nh sau:
Hnh 3.5. Danh mc ha n
Mt s thuc tnh ca bng l:
+ MaHD: M ho n.
+ MaKH: M khch hng.
+ NgayHD: Ngy nhp ho n.
+ Ghichu: Ghi ch ha n.
57
8/8/2019 Bao Cao Tot Nghiep KPDL
58/67
3.4.6. Bng danh mc chi tit Ho n.
Cu trc v d liu ca bng chi tit ha m nh sau:
Hnh 3.6. Danh mc chi tit ha n
Mt s thuc tnh ca bng l:
+ MaHD: M ho n.
+ MaH: M hng ha.
+ SoLuong: S lng hng ha.
58
8/8/2019 Bao Cao Tot Nghiep KPDL
59/67
3.4.7. Ghi XML.
Hnh 3.7. Ghi XML
3.5. Giao din chnh chng trnh.
Hnh 3.8. Giao din chnh ca chng trnh
59
8/8/2019 Bao Cao Tot Nghiep KPDL
60/67
3.6. Kt ni d liu.
Hnh 3.9. Kt ni d liu
3.7. Thm d liu Xml
Hnh 3.10. Thm d liu XML
60
8/8/2019 Bao Cao Tot Nghiep KPDL
61/67
3.8. Kt qu phn tch
Hnh 3.11. Kt qu phn tch
3.9. Kt qu lc MinSup = 10
Hnh 3.12. Kt qu lc ph bin ti thiu
61
8/8/2019 Bao Cao Tot Nghiep KPDL
62/67
3.10. Kt qu lc MinCon = 40%
Hnh 3.13. Kt qu lc tin cy
* Kt lun chung III:
Ci t bng thut ton Apriori p dng trong qun l bn hng ti thsiu. Da vo kt qu ny m ngi qun l bit c nhng nhm mt hng nolin quan ti nhau, phc v cho mc ch qun l v la chn cc mt hng kinhdoanh.
62
8/8/2019 Bao Cao Tot Nghiep KPDL
63/67
8/8/2019 Bao Cao Tot Nghiep KPDL
64/67
HNG PHT TRIN TIMt trong nhng cng vic quan trng ca khai ph lut kt hp l tm tt c
cc tp ph bin trong c s d liu, nn trong thi gian ti chng em s pht trin
ti rng ra theo hng: ng dng thut ton song song p dng cho bi tonkhai ph lut kt hp m, l lut kt hp trong cc tp thuc tnh m.
Thut ton song song chia u c s d liu v tp ng vin cho cc b vi sl, v cc tp ng vin sau khi chia cho tng b x l l hon ton c lp vinhau mc ich ci thin chi ph tm lut kt hp m v thi gian m ho d liu.
Do nhc im ca thut ton Apriori l nu d liu ln th s phn tch smt rt nhiu thi gian v vy khc phc c nhc im th chng ta cns dng thm mt s thut ton khc v d nh thut ton FP_Growth, thut ton
song song,..Tip tc hon thin h thng Qun l bn hng ti siu th v c th ng
dng thm vo cc lnh vc khc nh bn hng ti cc siu th, bn my tnh,..
Khi m lng d liu thu thp v lu tr ngy cng tng, cng vi nhu cunm bt thng tin, th nhim v t ra cho Khai ph d liu ngy cng quan trng.S p dng c vo nhiu lnh vc kinh t x hi, an ninh quc phng cng l mtu th ca khai ph d liu. Vi nhng mong mun chng em hy vng s dna nhng kin thc c t ti ny sm tr thnh thc t, phc v cho cuc
sng con ngi chng ta.
64
8/8/2019 Bao Cao Tot Nghiep KPDL
65/67
TI LIU THAM KHO[1]. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I.Verkamo. Fastdiscovery of association rules. In Advances in Knowledge Discovery and Data
Mining, pages 307328,1996.[2]. R. Agrawal and R. Srikant. Fast algorithms for mining associationrules. TheInternational Conference on Very LargeDatabases, pages 487499, 1994.
[3]. R. Agrawal and R. Srikant. Mining sequential patterns. InP. S. Yu and A. L. P.Chen, editors, Proc. 11th Int. Conf. DataEngineering, ICDE, pages 314. IEEEPress, 610 1995.
[4]. N. F.Ayan, A. U. Tansel, and M. E. Arkun. An efficient algorithm to updatelarge itemsets with early pruning. In KnowledgeDiscovery and Data Mining, pages
287291, 1999.[5].TS Phc, Khai thc d liu, Nh xut bn i Hc Quc Gia TP HCM 2005.
[6].Phm Hu Khang, K thut lp trnh C#.Net, Nh xut bn Lao ng- X Hi.
[7].Tng bc hc lp trnh Visual C#.Net, Nh xut bn Lao ng- X Hi.
[8]. Gio trnh tr tu nhn to - cu trc d liu - gii thut di truyn, Nh xut bnLao ng- X Hi.
[9]. http://www.cs.uh.edu/~ceick/6340/grue-assoc.pdf, truy cp cui cng ngy20/03/2009.
[10].http://www.vnulib.edu.vn:8000/dspace/bitstream/123456789/1811/1/sedev0206-03.pdf, truy cp cui cng ngy 22/03/2009.
[11].http://gralib.hcmuns.edu.vn/gsdl/collect/hnkhbk/index/assoc/HASH0107.dir/doc.pdf, truy cp cui cng ngy 20-03-2009.
[12].http://www.tapchibcvt.gov.vn/News/PrintView.aspx?ID=15671, truy cp cuicng ngy 22-03-2009.
[13].http://www.uit.edu.vn/forum/index.php?act=Attach&type=post&id=50641,
truy cp cui cng ngy 20-03-2009.
65
http://www.cs.uh.edu/~ceick/6340/grue-assoc.pdfhttp://gralib.hcmuns.edu.vn/gsdl/collect/hnkhbk/index/assoc/HASH0107.dir/doc.pdfhttp://gralib.hcmuns.edu.vn/gsdl/collect/hnkhbk/index/assoc/HASH0107.dir/doc.pdfhttp://www.tapchibcvt.gov.vn/News/PrintView.aspx?ID=15671http://www.uit.edu.vn/forum/index.php?act=Attach&type=post&id=50641http://www.cs.uh.edu/~ceick/6340/grue-assoc.pdfhttp://gralib.hcmuns.edu.vn/gsdl/collect/hnkhbk/index/assoc/HASH0107.dir/doc.pdfhttp://gralib.hcmuns.edu.vn/gsdl/collect/hnkhbk/index/assoc/HASH0107.dir/doc.pdfhttp://www.tapchibcvt.gov.vn/News/PrintView.aspx?ID=15671http://www.uit.edu.vn/forum/index.php?act=Attach&type=post&id=506418/8/2019 Bao Cao Tot Nghiep KPDL
66/67
BNG I CHIU THUT NG VIT - ANH
Ting Anh Ting Vit
Data Mining Khai ph d liu
Data D liu
Knowledge Discovery in Database-KDD Pht hin tri thc trong c s d liu
Target Mc ch, mc tiu.
Clearsed Preprocessed Prepadated Lm sch - Tin x l - Chun b trc
Transform Chuyn i
Pattern Discovery Khm ph m hnh
Knowlege Tri thc
Clustering Phn cm
Summerization Tm tt
Visualiztion Trc quan ho
Evolution and deviation analyst Phn tch s pht trin v lch
Association rules Phn tch lut kt hp
Classification Phn lp
Regression Hi quy
Clustering Gom nhm
Summarization Tng hp
Dependency modeling M hnh rng buc
Change and Deviation Dectection D tm bin i v lch
Hi qui Regression
Cross validation nh gi cho
Support Ph binMinimum Support ph bin ti thiu
Confidence tin cy
Minimum Confidence tin cy ti thiu
Itemset Hng mc
Procedur