[UpdateBook.vn]-Khai Pha Du Lieu Bang Cay Qdinh

  • Upload
    diendom

  • View
    311

  • Download
    0

Embed Size (px)

Citation preview

khai ph d liu bng cy quyt nh. Information MC LC LI M U Chng 1: Tng quan v khai ph d liu 1.1. Khm ph tri thc v khai ph d liu l g? 1.2. Qu trnh pht hin tri thc 1.2.1. Hnh thnh v nh ngha bi ton 1.2.2. Thu thp v tin x l d liu 1.2.3. Khai ph d liu v rt ra cc tri thc 1.2.4. Phn tch v kim nh kt qu 1.2.5. S dng cc tri thc pht hin c 1.3. Qu trnh khai ph d liu 1.3.1. Gom d liu (gatherin) 1.3.2. Trch lc d liu (selection) 1.3.3. Lm sch v tin x l d liu (cleansing preprocessing). 1.3.4. Chuyn i d liu (transformation) 1.3.5. Pht hin v trch mu d liu ( pattern extraction and discovery) 1.3.6. nh gi kt qu mu (evaluation of result ) 1.4. Chc nng ca khai ph d liu 1.5. Cc k thut khai ph d liu 1.5.1. Phn lp d liu: 1.5.2. Phn cm d liu: 1.5.3. Khai ph lut kt hp: 1.5.4. Hi quy: 1.5.5. Gii thut di truyn: 1.5.6. Mng nron: 1.5.7. Cy quyt nh. 1.6. Cc dng d liu c th khai ph c 1.7. Cc lnh vc lin quan n khai ph d liu v ng dng ca khai ph d liu 1.7.1. Cc lnh vc lin quan n pht hin tri thc v khai ph d liu 1.7.2. ng dng ca khai ph d liu 1.8. Cc thch thc v hng pht trin ca pht hin tri thc v khai ph d liu. Chng 2: Khai ph d liu bng cy quyt nh 2.1. Cy quyt nh 2.1.1. nh ngha cy quyt nh 2.1.2. u im ca cy quyt nh 2.1.3. Vn xy dng cy quyt nh 2.1.4. Rt ra cc lut t cy quyt nh. 2.2. Cc thut ton khai ph d liu bng cy quyt nh 2.2.1. Thut ton CLS 2.2.2. Thut ton ID3 2.2.3. Thut ton C4.5 2.2.4. Thut ton SLIQ[5]

2.2.5. Ct ta cy quyt nh 2.2.6. nh gi v kt lun v cc thut ton xy dng cy quyt nh Chng 3: Xy dng chng trnh dmo 3.1. M t bi ton 3.2. Thu thp v tin x l d liu 3.3. Chng trnh Chng 4. KT LUN 4.1 nh Gi 4.1.1 L thuyt 4.1.2 ng dng 4.2 Hng Pht Trin

2

MC LC LI M U.............................................................................................................4 Chng 1: Tng quan v khai ph d liu..............................................................5 Khm ph tri thc v khai ph d liu l g?........................................................5 Qu trnh pht hin tri thc....................................................................................5 Hnh thnh v nh ngha bi ton.....................................................................6 Thu thp v tin x l d liu............................................................................6 Khai ph d liu v rt ra cc tri thc...............................................................6 Phn tch v kim nh kt qu ........................................................................7 S dng cc tri thc pht hin c.................................................................7 Qu trnh khai ph d liu.....................................................................................7 Gom d liu (gatherin).......................................................................................7 Trch lc d liu (selection)...............................................................................8 Lm sch v tin x l d liu (cleansing preprocessing)................................8 Chuyn i d liu (transformation)..................................................................8 Pht hin v trch mu d liu ( pattern extraction and discovery) ..................8 nh gi kt qu mu (evaluation of result )....................................................8 Chc nng ca khai ph d liu...........................................................................8 Cc k thut khai ph d liu................................................................................8 Phn lp d liu: ..............................................................................................9 Phn cm d liu:..............................................................................................9 Khai ph lut kt hp: .......................................................................................9 Hi quy: 9 Gii thut di truyn:............................................................................................9 Mng nron: .....................................................................................................9 Cy quyt nh...................................................................................................9 Cc dng d liu c th khai ph c..............................................................10 Cc lnh vc lin quan n khai ph d liu v ng dng ca khai ph d liu10 Cc lnh vc lin quan n pht hin tri thc v khai ph d liu..................10 ng dng ca khai ph d liu.......................................................................10 Cc thch thc v hng pht trin ca pht hin tri thc v khai ph d liu. .............................................................................................................................10 Chng 2: Khai ph d liu bng cy quyt nh..................................................11 Cy quyt nh.....................................................................................................11 nh ngha cy quyt nh...............................................................................11 u im ca cy quyt nh............................................................................11 Vn xy dng cy quyt nh.....................................................................12 Rt ra cc lut t cy quyt nh.....................................................................12

3

Cc thut ton khai ph d liu bng cy quyt nh.........................................12 Thut ton CLS................................................................................................12 Thut ton ID3.................................................................................................13 Thut ton C4.5...............................................................................................14 Thut ton SLIQ[5]..........................................................................................17 Ct ta cy quyt nh......................................................................................18 nh gi v kt lun v cc thut ton xy dng cy quyt nh...................20 Chng 3: Xy dng chng trnh dmo..............................................................21 M t bi ton......................................................................................................21 Thu thp v tin x l d liu..............................................................................21 Chng trnh.......................................................................................................22 Chng 4. KT LUN............................................................................................22 4.1 nh Gi........................................................................................................22 4.1.1 L thuyt.................................................................................................22 4.1.2 ng dng................................................................................................22 4.2 Hng Pht Trin..........................................................................................22 TI LIU THAM KHO.......................................................................................22 Ti liu ting Vit.................................................................................................22 Ti liu ting Anh.................................................................................................22

LI M U Trong nhiu nm qua, cng vi s pht trin ca cng ngh thng tin v ng dng ca cng ngh thng tin trong nhiu lnh vc ca i sng x hi, th lng d liu c cc c quan thu thp v lu tr ngy mt nhiu ln. Ngi ta lu tr nhng d liu ny v cho rng n n cha nhng gi tr nht nh no . Tuy nhin theo thng k th ch c mt lng nh ca nhng d liu ny (khong t 5% n 10%) l lun c phn tch, s cn li h khng bit s phi lm g v c th lm g vi nhng d liu ny, nhng h vn tip tc thu thp v lu tr v hy vng nhng d liu ny s cung cp cho h nhng thng tin qu gi mt cch nhanh chng a ra nhng quyt nh kp thi vo mt lc no . Chnh v vy, cc phng php qun tr v khai thc c s d liu truyn thng ngy cng khng p ng c thc t lm pht trin mt khuynh hng k thut mi l K thut pht hin tri thc v khai ph d liu (KDD - Knowledge Discovery and Data Mining).

4

K thut pht hin tri thc v khai ph d liu v ang c nghin cu, ng dng trong nhiu lnh vc khc nhau trn th gii, ti Vit Nam k thut ny cn tng i mi m tuy nhin cng ang c nghin cu v bt u a vo mt s ng dng thc t. V vy, hin nay nc ta vn pht hin tri thc v khai ph d liu ang thu ht c s quan tm ca nhiu ngi v nhiu cng ty pht trin ng dng cng ngh thng tin. Trong phm vi ti nghin cu khoa hc ny ca em, em s trnh by nhng ni dung sau: Chng 1: Tm hiu nhng kin thc tng quan v khm ph tri thc v khai ph d liu. Chng 2: Nghin cu k thut khai ph d liu bng cy quyt nh. Chng 3: Xy dng ng dng demo cho k thut khai ph d liu bng cy quyt nh

Chng 1: Tng quan v khai ph d liu Khm ph tri thc v khai ph d liu l g? Pht hin tri thc (Knowledge Discovery ) trong cc c s d liu l mt qui trnh nhn bit cc mu hoc cc m hnh trong d liu vi cc tnh nng: hp thc, mi, kh ch, v c th hiu c [4]. Cn khai thc d liu (data mining) l mt ng tng i mi, n ra i vo khong nhng nm cui ca ca thp k 1980. C rt nhiu nh ngha khc nhau v khai ph d liu. Gio s Tom Mitchell a ra nh ngha ca khai ph d liu nh sau: Khai ph d liu l vic s dng d liu lch s khm ph nhng qui tc v ci thin nhng quyt nh trong tng lai.. Vi mt cch tip cn ng dng hn, tin s Fayyad pht biu: Khai ph d liu thng c xem l vic khm ph tri thc trong cc c s d liu, l mt qu trnh trch xut nhng thng tin n, trc y cha bit v c kh nng hu ch, di dng cc quy lut, rng buc, qui tc trong c s d liu.. Cn cc nh thng k th xem " khai ph d liu nh l mt qu trnh phn tch c thit k thm d mt lng cc ln cc d liu nhm pht hin ra cc mu thch hp v/ hoc cc mi quan h mang tnh h thng gia cc bin v sau s hp thc ho cc kt qu tm c bng cch p dng cc mu pht hin c cho tp con mi ca d liu". Ni tm li: khai ph d liu l mt bc trong quy trnh pht hin tri thc gm c cc tht ton khai thc d liu chuyn dng di mt s quy nh v hiu qu tnh ton chp nhn c tm ra cc mu hoc cc m hnh trong d liu [4]. Qu trnh pht hin tri thc Qu trnh khm ph tri thc c tin hnh qua 5 bc sau [5]:

5

Hnh 1.1. Qu trnh khm ph tri thc Hnh thnh v nh ngha bi ton y l bc tm hiu lnh vc ng dng v hnh thnh bi ton, bc ny s quyt nh cho vic rt ra nhng tri thc hu ch, ng thi la chn cc phng php khai ph d liu thch hp vi mc ch ca ng dng v bn cht ca d liu. Thu thp v tin x l d liu Trong bc ny d liu c thu thp dng th (ngun d liu thu thp c th l t cc kho d liu hay ngun thng tin internet). Trong giai on ny d liu cng c tin x l bin i v ci thin cht lng d liu cho ph hp vi phng php khai ph d liu c chn la trong bc trn. Bc ny thng chim nhiu thi gian nht trong qu trnh khm ph tri thc. Cc gii thut tin x l d liu bao gm : 1. X l d liu b mt/ thiu: Cc dng d liu b thiu s c thay th bi cc gi tr thch hp 2. Kh s trng lp: cc i tng d liu trng lp s b loi b i. K thut ny khng c s dng cho cc tc v c quan tm n phn b d liu. 3. Gim nhiu: nhiu v cc i tng tch ri khi phn b chung s b loi i khi d liu. 4. Chun ho: min gi tr ca d liu s c chun ho. 5. Ri rc ho: cc dng d liu s s c bin i ra cc gi tr ri rc. 6. Rt trch v xy dng c trng mi t cc thuc tnh c. 7. Gim chiu: cc thuc tnh cha t thng tin s c loi b bt. Khai ph d liu v rt ra cc tri thc y l bc quan trng nht trong tin trnh khm ph tri thc. Kt qu ca bc ny l trch ra c cc mu v/hoc cc m hnh n di cc d liu. Mt m hnh c th l mt biu din cu trc tng th mt thnh phn ca h thng hay c h thng trong c s d liu, hay miu t cch d liu c ny sinh. Cn mt mu l mt cu trc cc b c lin quan n vi bin v vi trng hp trong c s d liu.

6

Phn tch v kim nh kt qu Bc th t l hiu cc tri thc tm c, c bit l lm sng t cc m t v d on. Trong bc ny, kt qu tm c s c bin i sang dng ph hp vi lnh vc ng dng v d hiu hn cho ngi dng. S dng cc tri thc pht hin c Trong bc ny, cc tri thc khm ph c s c cng c, kt hp li thnh mt h thng, ng thi gii quyt cc xung t tim nng trong cc tri thc . Cc m hnh rt ra c a vo nhng h thng thng tin thc t di dng cc mdun h tr vic a ra quyt nh. Cc giai on ca qu trnh khm ph tri thc c mi quan h cht ch vi nhau trong bi cnh chung ca h thng. Cc k thut c s dng trong giai on trc c th nh hng n hiu qu ca cc gii thut c s dng trong cc giai on tip theo. Cc bc ca qu trnh khm ph tri thc c th c lp i lp li mt s ln, kt qu thu c c th c ly trung bnh trn tt c cc ln thc hin. Qu trnh khai ph d liu Khai ph d liu l hot ng trng tm ca qu trnh khm ph tri thc . Thut ng khai ph d liu cn c mt s nh khoa hc gi l pht hin tri thc trong c s d liu ( knowledge discovery in database _KDD) ( theo Fayyad Smyth and Piatestky-Shapiro 1989). Qu trnh ny gm c 6 bc [1]:

Hnh 1.2. Qu trnh khai ph d liu Qu trnh khai ph d liu bt u vi kho d liu th v kt thc vi tri thc c chit xut ra. Ni dung ca qu trnh nh sau: Gom d liu (gatherin) Tp hp d liu l bc u tin trong khai ph d liu. Bc ny ly d liu t trong mt c s d liu, mt kho d liu, thm ch d liu t nhng ngun cung ng web.

7

Trch lc d liu (selection) giai on ny d liu c la chn v phn chia theo mt s tiu chun no . Lm sch v tin x l d liu (cleansing preprocessing). Giai on th ba ny l giai on hay b sao lng, nhng thc t n l mt bc rt quan trng trong qu trnh khai ph d liu. Mt s li thng mc phi trong khi gom d liu l d liu khng y hoc khng thng nht, thiu cht ch. V vy d liu thng cha cc gi tr v ngha v khng c kh nng kt ni d liu. V d Sinh vin c tui=200. Giai on th ba ny nhm x l cc d liu nh trn(d liu v ngha, d liu khng c kh nng kt ni). Nhng d liu dng ny thng c xem l thng tin d tha, khng c gi tr. Bi vy y l mt qu trnh rt quan trng. Nu d liu khng c lm sch- tin x l - chun b trc th s gy nn nhng kt qu sai lch nghim trng v sau. Chuyn i d liu (transformation) Trong giai on ny, d liu c th c t chc v s dng li. Mc ch ca vic chuyn i d liu l lm cho d liu ph hp hn vi mc ch khai ph d liu. Pht hin v trch mu d liu ( pattern extraction and discovery) y l bc t duy trong khai ph d liu. trong giai on ny nhiu thut ton khc nhau c s dng trch ra cc mu t d liu. Thut ton thng dng trch mu d liu l thut ton phn loi d liu, kt hp d liu, thut ton m hnh ho d liu tun t. nh gi kt qu mu (evaluation of result ) y l giai on cui cng trong qu trnh khai ph d liu, giai on ny cc mu d liu c chit xut ra bi phn mm khai ph d liu. Khng phi mu d liu no cng hu ch, i khi n cn b sai lch. V vy cn phi a ra nhng tiu chun nh gi u tin cho cc mu d liu rt ra c nhng tri thc cn thit. Chc nng ca khai ph d liu Khai ph d liu c hai chc nng c bn l: chc nng d on v chc nng m t. Cc k thut khai ph d liu Trong thc t c nhiu k thut khai ph d liu khc nhau nhm thc hin hai chc nng m t v d on. - K thut khai ph d liu m t: c nhim v m t cc tnh cht hoc cc c tnh chung ca d liu trong CSDL hin c. Mt s k thut khai ph trong nhm ny l: phn cm d liu (Clustering), tng hp (Summarisation), trc quan ho (Visualization), phn tch s pht trin v lch (Evolution and deviation analyst),. - K thut khai ph d liu d on: c nhim v a ra cc d on da vo cc suy din trn c s d liu hin thi. Mt s k thut khai ph trong nhm ny l: phn lp (Classification), hi quy (Regression), cy quyt nh (Decision tree), thng k (statictics), mng nron (neural network), lut kt hp,.

8

Mt s k thut ph bin thng c s dng khai ph d liu hin nay l : Phn lp d liu: Mc tiu ca phn lp d liu l d on nhn lp cho cc mu d liu. Qu trnh gm hai bc: xy dng m hnh, s dng m hnh phn lp d liu( mi mu 1 lp). M hnh c s dng d on nhn lp khi m chnh xc ca m hnh chp nhn c. Phn cm d liu: Mc tiu ca phn cm d liu l nhm cc i tng tng t nhau trong tp d liu vo cc cum, sao cho cc i tng thuc cng mt lp l tng ng. Khai ph lut kt hp: Mc tiu ca phng php ny l pht hin v a ra cc mi lin h gia cc gi tr d liu trong c s d liu. u ra ca gii thut lut kt hp l tp lut kt hp tm c. Phng php khai ph lut kt hp gm c hai bc: - Bc 1: Tm ra tt c cc tp mc ph bin. Mt tp mc ph bin c xc nh thng qua tnh h tr v tho mn h tr cc tiu. - Bc 2: Sinh ra cc lut kt hp mnh t tp mc ph bin, cc lut phi tho mn h tr v tin cy cc tiu. Hi quy: Phng php hi quy tng t nh l phn lp d liu. Nhng khc ch n dng d on cc gi tr lin tc cn phn lp d liu dng d on cc gi tr ri rc. Gii thut di truyn: L qu trnh m phng theo tin ho ca t nhin. tng chnh ca gii thut l da vo quy lut di truyn trong bin i, chn lc t nhin v tin ho trong sinh hc. Mng nron: y l mt trong nhng k thut khai ph d liu c ng dng ph bin hin nay. K thut ny pht trin da trn mt nn tng ton hc vng vng, kh nng hun luyn trong k thut ny da trn m hnh thn kinh trung ng ca con ngi. Kt qu m mng nron hc c c kh nng to ra cc m hnh d bo, d on vi chnh xc v tin cy cao. N c kh nng pht hin ra c cc xu hng phc tp m k thut thng thng khc kh c th pht hin ra c. Tuy nhin phng php mng n ron rt phc tp v qu trnh tin hnh n gp rt nhiu kh khn: i hi mt nhiu thi gian, nhiu d liu, nhiu ln kim tra th nghim. Cy quyt nh. K thut cy quyt nh l mt cng c mnh v hiu qu trong vic phn lp v d bo. Cc i tng d liu c phn thnh cc lp. Cc gi tr ca i tng d liu cha bit s c d on, d bo. Tri thc c rt ra trong k thut ny thng c m t di dng tng minh, n gin, trc quan, d hiu i vi ngi s dng.

9

Cc dng d liu c th khai ph c - CSDL quan h - CSDL a chiu - CSDL giao dch - CSDL quan h - i tng - CSDL khng gian v thi gian - CSDL a phng tin. Cc lnh vc lin quan n khai ph d liu v ng dng ca khai ph d liu Cc lnh vc lin quan n pht hin tri thc v khai ph d liu Pht hin tri thc v khai ph d liu c ng dng trong nhiu ngnh v lnh vc khc nhau nh: ti chnh ngn hng, thng mi, y t, gio dc, thng k, my hc, tr tu nhn to, csdl, thut ton ton hc, tnh ton song song vi tc cao, thu thp c s tri thc cho h chuyn gia, ng dng ca khai ph d liu Khai ph d liu c vn dng gii quyt cc vn thuc nhiu lnh vc khc nhau. Chng hn nh gii quyt cc bi ton phc tp trong cc ngnh i hi k thut cao, nh tm kim m du, t nh vin thm, cnh bo hng hc trong cc h thng sn xut; c ng dng cho vic quy hoch v pht trin cc h thng qun l v sn xut trong thc t nh d on ti s dng in, mc tiu th sn phm, phn nhm khch hng; p dng cho cc vn x hi nh pht hin ti phm, tng cng an ninh Mt s ng dng c th nh sau : - Khai ph d liu c s dng phn tch d liu, h tr ra quyt nh. - Trong sinh hc: n dng tm kim , so snh cc h gen v thng tin di chuyn, tm mi lin h gia cc h gen v chun on mt s bnh di chuyn - Trong y hc: khai ph d liu gip tm ra mi lin h gia cc triu chng, chun on bnh. - Ti chnh v th trng chng khon: Khai ph d liu phn tch tnh hnh ti chnh, phn tch u t, phn tch c phiu - Khai thc d liu web. - Trong thng tin k thut: khai ph d liu dng phn tch cc sai hng, iu khin v lp lch trnh Trong thng tin thng mi: dng phn tch d liu ngi dng, phn tch d liu marketing, phn tch u t, pht hin cc gian ln. Cc thch thc v hng pht trin ca pht hin tri thc v khai ph d liu. S pht trin ca pht hin tri thc v khai ph d liu gp phi mt s thch thc sau: - CSDL ln (s lng bn ghi, s bng) - S chiu ln - Thay i d liu v tri thc c th lm cho cc mu pht hin khng cn ph hp na. - D liu b thiu hoc b nhiu. - Quan h gia cc trng phc tp - Vn giao tip vi ngi s dng v kt hp vi cc tri thc c. - Tch hp vi cc h thng khc.

10

- Hng pht trin ca khm ph tri thc v khai ph d liu l vt qua c tt c nhng thch thc trn. Ch trng vo vic m rng ng dng p ng cho mi lnh vc trong i sng x hi, v tng tnh hu ch ca vic khai ph d liu trong nhng lnh vc c khai ph d liu. To ra cc phng php khai ph d liu linh ng, uyn chuyn x l s lng d liu ln mt cch hiu qu. To ra tng tc ngi s dng tt, gip ngi s dng tham gia iu khin qu trnh khai ph d liu, nh hng h thng khai ph d liu trong vic pht hin cc mu ng quan tm. Tch hp khai ph d liu vo trong cc h c s d liu. ng dng khai ph d liu khai ph d liu web trc tuyn. Mt vn quan trng trong vic pht trin khm ph tri thc v khai ph d liu l vn an ton v bo mt thng tin trong khai ph d liu. Chng 2: Khai ph d liu bng cy quyt nh Cy quyt nh nh ngha cy quyt nh Trong lnh vc hc my, cy quyt nh l mt kiu m hnh d bo (predictive model), ngha l mt nh x t cc quan st v mt s vt/hin tng ti cc kt lun v gi tr mc tiu ca s vt/hin tng. Mi nt trong (internal node) tng ng vi mt bin; ng ni gia n vi nt con ca n th hin gi tr c th cho bin . Mi nt l i din cho gi tr d on ca bin mc tiu, cho trc cc gi tr d on ca cc bin c biu din bi ng i t nt gc ti nt l . K thut hc my dng trong cy quyt nh c gi l hc bng cy quyt nh, hay ch gi vi ci tn ngn gn l cy quyt nh. [3] V d: Cy quyt nh phn lp mc lng Age?

35 salary

> 35 salary

40

>40

50

>50

bad

good

bad

good

Hnh 2.1 Cy quyt nh phn lp mc lng u im ca cy quyt nh So vi cc phng php khai ph d liu khc, cy quyt nh c mt s u im sau - Cy quyt nh tng i d hiu.

11

-

i hi mc tin x l d liu n gin. C th x l vi c cc d liu ri rc v lin tc. Cy quyt nh l mt m hnh hp trng. Kt qu d on bng cy quyt nh c th thm nh li bng cch kim tra thng k.

Vn xy dng cy quyt nh C nhiu thut ton khc nhau xy dng cy quyt nh nh: CLS, ID3, C4.5, SLIQ, SPRINT, EC4.5, C5.0Nhng ni chung qu trnh xy dng cy quyt nh u c chia ra lm 3 giai on c bn: a. Xy dng cy: Thc hin chia mt cch quy tp mu d liu hun luyn cho n khi cc mu mi nt l thuc cng mt lp b. Ct ta cy: L vic lm dng ti u ho cy. Ct ta cy chnh l vic trn mt cy con vo trong mt nt l. c. nh gi cy: Dng nh gi chnh xc ca cy kt qu. Tiu ch nh gi l tng s mu c phn lp chnh xc trn tng s mu a vo. Rt ra cc lut t cy quyt nh. C th chuyn i qua li gia m hnh cy quyt nh v m hnh dng lut (IF THEN). Hai m hnh ny l tng ng nhau. V d t cy 2.1 ta c th rt ra c cc lut sau. IF (Age 50) THEN class = good Cc thut ton khai ph d liu bng cy quyt nh Thut ton CLS Thut ton ny c Hovland v Hint gii thiu trong Concept learning System (CLS) vo nhng nm 50 ca th k 20. Sau gi tt l thut ton CLS. Thut ton CLS c thit k theo chin lc chia tr t trn xung. N gm cc bc sau [6]: 1. To mt nt T, nt ny gm tt c cc mu ca tp hun luyn. 2. Nu tt c cc mu trong T c thuc tnh quyt nh mang gi tr "yes" (hay thuc cng mt lp), th gn nhn cho nt T l "yes" v dng li. T lc ny l nt l. 3. Nu tt c cc mu trong T c thuc tnh quyt nh mang gi tr "no" (hay thuc cng mt lp), th gn nhn cho nt T l "no" v dng li. T lc ny l nt l. 4. Trng hp ngc li cc mu ca tp hun luyn thuc c hai lp "yes" v "no" th: + Chn mt thuc tnh X trong tp thuc tnh ca tp mu d liu , X c cc gi tr v1,v2, vn. + Chia tp mu trong T thnh cc tp con T1, T2,.,Tn. chia theo gi tr ca X. + To n nt con Ti (i=1,2n) vi nt cha l nt T. + To cc nhnh ni t nt T n cc nt Ti (i=1,2n) l cc thuc tnh ca X. 5. Thc hin lp cho cc nt con Ti(i =1,2..n) v quay li bc 2. 12

Ta nhn thy trong bc 4 ca thut ton, thuc tnh c chn trin khai cy l tu . Do vy cng vi mt tp mu d liu hun luyn nu p dng thut ton CLS vi th t chn thuc tnh trin khai cy khc nhau, s cho ra cc cy c hnh dng khc nhau. Vic la chn thuc tnh s nh hng ti rng, su, phc tp ca cy. V vy mt cu hi t ra l th t thuc tnh no c chn trin khai cy s l tt nht. Vn ny s c gii quyt trong thut ton ID3 di y. Thut ton ID3 Thut ton ID3 c pht biu bi Quinlan (trng i hc Syney, Australia) v c cng b vo cui thp nin 70 ca th k 20. Sau , thut ton ID3 c gii thiu v trnh by trong mc Induction on decision trees, machine learning nm 1986. ID3 c xem nh l mt ci tin ca CLS vi kh nng la chn thuc tnh tt nht tip tc trin khai cy ti mi bc. ID3 xy dng cy quyt nh t trn- xung (top -down) [5] . Entropy [5]: dng o tnh thun nht ca mt tp d liu. Entropy ca mt tp S c tnh theo cng thc (1) Entropy(S)= - P + log 2 ( P + ) P- log2 ( P ) (2.1) Trong trng hp cc mu d liu c hai thuc tnh phn lp "yes" (+), "no" (-). K hiu p+ l ch t l cc mu c gi tr ca thuc tnh quyt nh l "yes", v p- l t l cc mu c gi tr ca thuc tnh quyt nh l "no" trong tp S. Trng hp tng qut, i vi tp con S c n phn lp th ta c cng thc sau: Entropy(S)=

(- P logi i=1

n

2

( Pi ))

(2.2)

Trong Pi l t l cc mu thuc lp i trn tp hp S cc mu kim tra. Cc trng hp c bit - Nu tt c cc mu thnh vin trong tp S u thuc cng mt lp th Entropy(S) =0 - Nu trong tp S c s mu phn b u nhau vo cc lp th Entropy(S) =1 - Cc trng hp cn li 0< Entropy(S)