Thuật Toán Cây Quyết Định C4.5

  • Upload
    kim-yen

  • View
    1.344

  • Download
    7

Embed Size (px)

Citation preview

Thut Ton Cy Quyt nh C4.5Sinh vin: Lu Cng T Ngi hng dn: V Tin Thnh

1

Outline Thut ton Cy quyt nh nh ngha. Xy dng cy quyt nh. c im cy quyt nh.

Thut ton C4.5 Lch s Pseudocode, cc cng thc v v d. C ch chng qu va d liu, chng thiu d liu thuc tnh. Chuyn sang lut ng dng. C4.5 v See5/C5.0 Hng pht trin.2

Thut ton cy quyt nh nh ngha: Cy quyt nh l biu quyt nh pht trin c cu trc dng cy: Gc: Node trn cng cy. Node trong: biu din 1 kim tra hoc 1 thuc tnh n Node l: biu din lp. Nhnh: Kt qu kim tra ca node trnGc

Node L

Node Trong Nhnh

Node L

Node L

3

V d cy quyt nh4

Thut ton cy quyt nh Xy dng cy quyt nh gm 2 bc: Pht trin cy quyt nh: i t gc, n cc nhnh, pht trin quy np theo hnh thc chia tr. Chn thuc tnh tt nht bng mt o nh trc Pht trin cy bng vic thm cc nhnh tng ng vi tng gi tr ca thuc tnh chn Sp xp, phn chia tp d liu o to ti node con Nu cc v d c phn lp r rng th dng. Ngc li: lp li bc 1 ti bc 4 cho tng node con

Ct ta cy: nhm n gin ha, khi qut ha cy, tng chnh xc5

Thut ton cy quyt nh VD: thut ton Hunt s dng trong C4.5, CDP... S={S1,S2,,Sn} l tp d liu o to C={C1,C2,,Cm} l tp cc lp TH1: Si (i=1n) thuc v Cj => Cy quyt nh l 1 l ng Cj. TH2: S thuc v nhiu lp trong C. Chn 1 test trn thuc tnh n c nhiu gi tr O={O1,..Ok} (k thng bng 2). Test t gc ca cy, mi Oi to thnh 1 nhnh, chia S thnh cc tp con c gi tr thuc tnh = Oi. quy cho tng tp con => cy quyt nh gm nhiu nhnh, mi nhnh tng ng vi Oi.

6

Thut ton cy quyt nh im mnh ca cy quyt nh: Sinh ra cc quy tc hiu c: chuyn i c sang ting Anh hoc SQL. Thc thi trong lnh vc hng quy tc. D dng tnh ton trong khi phn lp. X l vi thuc tnh lin tc v ri rc. Th hin r rng nhng thuc tnh tt nht: phn chia d liu t gc.

im yu ca cy quyt nh: D xy ra li khi c nhiu lp: do ch thao tc vi cc lp c gi tr dng nh phn. Chi ph tnh ton t hc: do phi i qua nhiu node n node l cui cng

7

Thut ton C4.5 L s pht trin t CLS v ID3. ID3 (Quinlan, 1979)- 1 h thng n gin ban u cha khong 600 dng lnh Pascal Nm 1993, J. Ross Quinlan pht trin thnh C4.5 vi 9000 dng lnh C. Hin ti: phin bn See5/C5.0. T tng thut ton: Hunt, chin lc pht trin theo su.

8

Thut ton C4.5 Pseudocode: Kim tra case c bn Vi mi thuc tnh A tm thng tin nh vic tch thuc tnh A Chn a_best l thuc tnh m o la chn thuc tnh tt nht Dng a_best lm thuc tnh cho node chia ct cy. quy trn cc danh sch ph c to ra bi vic phn chia theo a_best, v thm cc node ny nh l con ca node(1) ComputerClassFrequency(T); (2) if OneClass or FewCases return a leaf; Create a decision node N; (3) ForEach Attribute A ComputeGain(A); (4) N.test=AttributeWithBestGain; (5) if (N.test is continuous) find Threshold; (6) ForEach T' in the splitting of T (7) If ( T' is Empty ) Child of N is a leaf else (8) Child of N=FormTree(T'); (9) ComputeErrors of N; return N9

Thut ton C4.5 o la chn thuc tnh tt nht: information gain v gain ratio Tn sut cc case Sj thuc v gi tr phn lp Cj |Sj| RF (Cj, S) = |S| Ch s thng tin cn thit cho s phn lp: I(S)

S={S1,S2,,Sx)10

Thut ton C4.5 o la chn thuc tnh tt nht: Information gain: Test B chia S={S ,S ,,S )1 2 t

Test B s c chn nu c G(S, B) t gi tr ln nht. Thng tin tim nng (potential information) ca bn thn mi phn hoch:

Gain ratio = G(S, B) / P(S, B) lnnht => chn test B

11

Thut ton C4.5 o o o o o V d: s1 (yes) 9 case,s2 (no) 5 case I(S) = I(s1,s2) = I(9, 5) = 0.940 A = age =>S={S1,S2,S3} S1 (age30). I (S1): s11(yes & ng dng vi database nh ( tn s li lp li 4% vi database 20000 cases). C c ch x l thiu, li hoc qu va d liu. Lut to ra n gin.15

Thut ton C4.5 ng dng vo bi ton phn lp d liu: Bc 1 (Hc): xy dng m hnh m t tp d liu; khi nim bit Input: tp d liu c cu trc c to m t bng cc thuc tnh Output: Cc lut IfThen

Bc 2 (Phn loi): da trn m hnh xy dng phn lp d liu mi: i t gc n cc nt l nhm rt ra lp ca i tng cn xt.16

Thut ton C4.5 ng dng vo bi ton phn lp d liu: X l vi d liu thuc tnh lin tc: S dng kim tra dng nh phn: value(V) < h vi h l hng s ngng (threshold) h c tm bng cch: Quick sort sp xp cc case trong S theo cc gi tr ca thuc tnh lin tc V ang xt =>V = {v1, v2, , vm} hi = (vi + v(i+1))/2. Test phn chia d liu:V hi => chia V thnh V1={v1,v2,, vi} v V2 = {vi+1, vi+2, , vm} v c hi (i=1m-1) Tnh Information gain hay Gain ratio vi tng hi. Ngng c gi tr ca Information gain hay Gain ratio ln nht s c chn lm ngng phn chia ca thuc tnh .

17

C4.5 v C5.0 Lut: C5.0 chnh xc hn, nhanh hn, tn t b nh hn.Blue: C5.0

Cy quyt nh: nhanh hn, nh hn

18

C4.5 v C5.0 Boost: to v kt hp nhiu lp phn loi tng chnh xc d onBlue: C5.0

Kiu d liu mi: vd ngy,thng

19

Hng nghin cu Cy n nh: Tn s li ca cy c xy dng t data case c cu trc thp hn nhiu so vi data case khng nhn thy c. VD: vi 20k case c cu trc, t l li l 4%, Cng 20k case v c 1 case khng c kim tra, t l li 11,7%

Yu cu t ra: xy dng cy m t l li l xp x nhau cho c 2 trng hp.

Phn ly cy phc tp: C th chia ct cy phc tp thnh cc cy nh, n gin m kt qu khng i ?20

Ti liu tham kho Nguyn Th Thy Linh (2005). Thut ton phn lp cy quyt nh, Kha lun tt nghip i hc, Trng i hc Cng ngh, 2005. [WKQ08] Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu , Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg (2008). Top 10 algorithms in data mining, Knowl Inf Syst (2008) 14:137 http://rulequest.com/see5-comparison.html http://en.wikipedia.org/wiki/ID3_algorithm http://en.wikipedia.org/wiki/C4.5_algorithm http://en.wikipedia.org/wiki/Decision_tree21

Cm n thy, anh ch v cc bn theo di!

22