39
Intelligent Software Lab. Machine Learning and Its Applications Decision Tree Learning using the WEKA Cheon Eum Park Kangwon National University

Kangwon National University - 강원대학교 컴퓨터과학전공cs.kangwon.ac.kr/~parkce/seminar/DecisionTree.pdf · 2016-06-17 · Intelligent Software Lab. Machine Learning and

  • Upload
    lykhue

  • View
    233

  • Download
    0

Embed Size (px)

Citation preview

Intelligent Software Lab.

Machine Learning and Its Applications

Decision Tree Learning using the WEKA

Cheon Eum Park

Kangwon National University

Intelligent Software Lab. 2

Index

• Intro.

• Decision Tree concept

• Weka

• Data format

• Learning Decision Trees

• Experiment

• Result

Intelligent Software Lab. 3

Intro.

• Classification practice with Weka• Problems: Identifying risky bank loans, Identifying

poisonous mushrooms

• Dataset (weka uses dataset format ‘.arff’)• Original data: Credit.csv, Mushroom.csv

• change from ‘.csv’ to ‘.arff’ (with python)

• Algorithms: Decision Tree J48 (C48), J48graft (C5)

• Tuning parameters to get better performance of D.T. models

• Comparing the Performances of algorithms• J48 vs. J48graft vs. Logistic vs. SVM vs. MLP

Intelligent Software Lab. 4

Decision Tree concept

자질들의 정보획득량(Information gain)에 따라 트리 형태의 규칙을자동 생성하는 기계학습 모델

Intelligent Software Lab. 5

Decision Tree concept

• Entropy

• Information Gain• 가장 큰 정보량을 가진 속성을 결정트리의 맨 위에 배치

𝐺𝐺𝐺𝐺𝐺𝐺𝐺𝐺 𝑆𝑆,𝐴𝐴 = 𝐸𝐸𝐺𝐺𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑆𝑆 − �𝑣𝑣∈𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉𝑉(𝐴𝐴)

𝑆𝑆𝑣𝑣𝑆𝑆 𝐸𝐸𝐺𝐺𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸(𝑆𝑆𝑣𝑣)

𝐸𝐸𝐺𝐺𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸𝐸 𝑆𝑆 ≡ −𝐸𝐸+ log2 𝐸𝐸+ − 𝐸𝐸− log2 𝐸𝐸−

𝐼𝐼𝑚𝑚 = −�𝑖𝑖=1

𝐾𝐾

𝐸𝐸𝑚𝑚𝑖𝑖 log2 𝐸𝐸𝑚𝑚𝑖𝑖Generalization:

Intelligent Software Lab. 6

Data format: credit.csv

attributes

instances

label

Intelligent Software Lab. 7

Data format: credit.arff

• @relation credit

• @attribute checking_balance {1-200DM,<0DM,>200DM,unknown}• @attribute months_loan_duration numeric• @attribute credit_history {poor,perfect,good,critical,verygood}• @attribute purpose {business,car,car0,furniture/appliances,education,renovations}• @attribute amount numeric• @attribute savings_balance {unknown,100-500DM,500-1000DM,<100DM,>1000DM}• @attribute employment_duration {1-4years,>7years,4-7years,<1year,unemployed}• @attribute percent_of_income numeric• @attribute years_at_residence numeric• @attribute age numeric• @attribute other_credit {none,store,bank}• @attribute housing {own,other,rent}• @attribute existing_loans_count numeric• @attribute job {management,unskilled,skilled,unemployed}• @attribute dependents numeric• @attribute phone {yes,no}• @attribute default {yes,no}

• @data• <0DM,6,critical,furniture/appliances,1169,unknown,>7years,4,4,67,none,own,2,skille

d,1,yes,no

Nominal: 범주형Numeric: 숫자형

Intelligent Software Lab. 8

Data format: mushroom.csv

attributes

instances

label

Intelligent Software Lab. 9

Data format: mushroom.arff

• @relation mushrooms

• @attribute cap_shape {flat,sunken,bell,knobbed,convex,conical}

• @attribute cap_surface {fibrous,grooves,smooth,scaly}

• @attribute cap_color {pink,brown,gray,purple,yellow,green,cinnamon,white,red,buff}

• @attribute bruises {yes,no}

• @attribute odor {none,almond,anise,pungent,musty,foul,spicy,fishy,creosote}

• @attribute gill_attachment {attached,free}

• @attribute gill_spacing {close,crowded}

• @attribute gill_size {broad,narrow}

• @attribute gill_color {pink,brown,gray,purple,yellow,chocolate,black,orange,green,white,red,buff}

• @attribute stalk_shape {enlarging,tapering}

• @attribute stalk_root {club,bulbous,equal,missing,rooted}

• @attribute stalk_surface_above_ring {fibrous,scaly,smooth,silky}

• @attribute stalk_surface_below_ring {fibrous,silky,smooth,scaly}

• @attribute stalk_color_above_ring {pink,gray,brown,yellow,cinnamon,orange,white,red,buff}

• @attribute stalk_color_below_ring {pink,gray,brown,yellow,cinnamon,orange,white,red,buff}

• @attribute veil_type {partial}

• @attribute veil_color {orange,brown,white,yellow}

• @attribute ring_number {none,two,one}

• @attribute ring_type {large,flaring,evanescent,pendant,none}

• @attribute spore_print_color {brown,purple,yellow,chocolate,black,orange,green,white,buff}

• @attribute population {solitary,several,scattered,clustered,abundant,numerous}

• @attribute habitat {urban,paths,grasses,leaves,woods,waste,meadows}

• @attribute type {poisonous,edible}

• @data

• knobbed,smooth,gray,no,none,free,crowded,broad,white,enlarging,missing,smooth,smooth,white,white,partial,white,two,pendant,white,numerous,grasses,edible

Intelligent Software Lab. 10

weka

Intelligent Software Lab. 11

Identifying risky bank loans (credit)

attribute values

attribute

class

attribute values에 맞게아래 도표로 표시

(도표는 왼쪽부터 No. 1)(for highlighted feature)

Intelligent Software Lab. 12

File open cont`

attribute 종류에맞춰서 분포를확인할 수 있음

Intelligent Software Lab. 13

Data Visualize

Intelligent Software Lab. 14

Choose classifier

Intelligent Software Lab. 15

Setting the parameters

ID3 (1979)C4.5 (1993)C4.8 (1996) J48C5.0 (commercial) J48graft

Intelligent Software Lab. 16

Setting the data set used for evaluation

Intelligent Software Lab. 17

Classifier output

Intelligent Software Lab. 18

Classifier output: J48 pruned tree

Importantattribute!!

Intelligent Software Lab. 19

Tree View

Intelligent Software Lab. 20

Training

Intelligent Software Lab. 21

Cross-validation

Intelligent Software Lab. 22

Test

Intelligent Software Lab. 23

Evaluation Metrics

• Accuracy (percent correct)

• TP, FP rate

• Precision

• Recall

• Other metrics: F-Measure, Kappa statistic, Mean absolute error etc.

Intelligent Software Lab. 24

Evaluation Metrics

• Confusion Matrix

RealPrediction Positive Negative

Positive TP FP yes

Negative FN TN no

a b

𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐴𝐸𝐸𝐺𝐺𝐴𝐴𝐸𝐸 =𝑇𝑇𝑇𝑇 + 𝑇𝑇𝑇𝑇

𝑇𝑇𝑇𝑇 + 𝐹𝐹𝑇𝑇 + 𝑇𝑇𝑇𝑇 + 𝐹𝐹𝑇𝑇

𝑇𝑇𝐸𝐸𝑃𝑃𝐴𝐴𝐺𝐺𝑃𝑃𝐺𝐺𝐸𝐸𝐺𝐺 =𝑇𝑇𝑇𝑇

𝑇𝑇𝑇𝑇 + 𝐹𝐹𝑇𝑇 𝑅𝑅𝑃𝑃𝐴𝐴𝐺𝐺𝑅𝑅𝑅𝑅 =𝑇𝑇𝑇𝑇

𝑇𝑇𝑇𝑇 + 𝐹𝐹𝑇𝑇 𝐹𝐹𝐹 = 2 �𝑇𝑇𝐸𝐸𝑃𝑃𝐴𝐴𝐺𝐺𝑃𝑃𝐺𝐺𝐸𝐸𝐺𝐺 � 𝑅𝑅𝑃𝑃𝐴𝐴𝐺𝐺𝑅𝑅𝑅𝑅𝑇𝑇𝐸𝐸𝑃𝑃𝐴𝐴𝐺𝐺𝑃𝑃𝐺𝐺𝐸𝐸𝐺𝐺 + 𝑅𝑅𝑃𝑃𝐴𝐴𝐺𝐺𝑅𝑅𝑅𝑅

Intelligent Software Lab. 25

Classifying Visualize

Intelligent Software Lab. 26

Identifying poisonous mushrooms

attribute values

attribute

Intelligent Software Lab. 27

Classifier output

Intelligent Software Lab. 28

Classifier output: J48 pruned tree

Importantattribute!!

Intelligent Software Lab. 29

Tree View

Intelligent Software Lab. 30

Training

Intelligent Software Lab. 31

Cross-validation

Intelligent Software Lab. 32

Test

Intelligent Software Lab. 33

Classifying Visualize

Intelligent Software Lab. 34

Parameter setting of D.T. algorithm

• J48• Main parameters: unpruned, numFolds,

minNumObj

• J48 graft• Main parameters: unpruned, minNumObj

Intelligent Software Lab. 35

Experiment: credit

accuracy (%) default(M2) M4 m6 m8 m10

train 85.22 83.78 80.56 80.11 79.56

cross 73.11 73.33 71.89 74.00 74.22

test 75 73 71 72 72

J48 (C48)

J48 graft (C5)

J48 (C48)

max: 72%

accuracy (%) default(M2) M4 m6 m8 m10

train 85.22 83.78 80.56 80.11 79.56

cross 73.22 73.44 72.22 74.11 74.11

test 75 73 71 72 72

other options: 75%

Intelligent Software Lab. 36

Experiment: mushroom

accuracy (%) default(M2) M4 m6 m8 m10

train 100.00 100.00 100.00 100.00 100.00

cross 99.99 99.99 99.99 99.99 99.99

test 100 100 100 100 100

J48 (C48)

J48 graft (C5)

J48 (C48)

max: 99.99%

accuracy (%) default(M2) M4 m6 m8 m10

train 100.00 100.00 100.00 100.00 100.00

cross 99.99 99.99 99.99 99.99 99.99

test 100 100 100 100 100

other options: 100%

Intelligent Software Lab. 37

Experiment: compare other models

Credit test acc Mushroom test acc

• Parameter of Models: default value

Intelligent Software Lab. 38

Result

credit mushroom

데이터 분할 방법: 양자화 방법(quantitative method)• D.T: nominal data better than numeric data• others: well numeric data

Intelligent Software Lab. 39

References

• Weka Wiki: http://weka.wikispaces.com/http://en.wikipedia.org/wiki/Weka_(machine_learning)

• Weka online documentation:http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html

• Textbooks• Ethem Alpaydin(2010), Introduction to Machine

Learning Second Edition, The MIT Press

• Brett Lantz(2013), Machine Learning with R, Packt Publishing