29
1 Data Mining dr Iwona Schab Decision Trees

Data Mining

  • Upload
    catrin

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

Data Mining. Decision Trees. dr Iwona Schab. Decision Trees. Method of classification Recursive procedure which ( progressively ) divides sets of n units into groups accoridng to a division rule - PowerPoint PPT Presentation

Citation preview

Page 1: Data  Mining

1

Data Mining

dr Iwona Schab

Decision Trees

Page 2: Data  Mining

2

Decision Trees

Method of classification Recursive procedure which (progressively) divides sets of n

units into groups accoridng to a division rule

Designed for supervised prediction problems (i.e. a set of input variables is used to prodict the value of target variable

The primary goal is prediction The fitted tree model is used for target variable prediction for

new cases (i.e. to score new cases/data)

Result: a final partition of the observations the Boolean rules needed to score new data

Page 3: Data  Mining

3

Decision Tree

A predictive model represented in a tree-like structure

Root node

A split based on the values of the input

Terminal node – the leaf

Internal node

Page 4: Data  Mining

4

Decission tree

Nonparametric method Allows for nonlinear relationships modelling Sound concept, Easy to interpret Robustness against outliers Detection and taking into accout of potential interactions

between input variables Additional implementation: categorisation of continiuos

variables, grouping of nopminal valueds

Page 5: Data  Mining

5

Decision Trees

Types: Classification trees (Categorical response variable)

the leafs give the predicted class and the probability of class membership

Regression trees (Continous response variable) the leafs give the predicted value of the target

Exemplary applications: Handwriting recognition Medical research Financial and capital markets

Page 6: Data  Mining

6

Decision Tree

The path to each leaf expresses as a Boolean rule: if … then … The ’regions’ of the input space determined by the split values Intersections of subspaces defined by a single splitting variable Regression tree model is a multivariate step function Leaves represent the perdicted target

All cases in a particular leaf are given the same predicted target

Splits: Binary Multiway splits (inputs partitioned into disjoined ranges)

Page 7: Data  Mining

7

Analytical decision

Recursive partitioning rule / splitting criterion

Pruning criterion / stopping criterion

Assignement of predicted target variable

Page 8: Data  Mining

8

Recursive partitioning rule

Method used to fit the tree Top-dow, greedy algorithm Starts at the root node Splits involving each single input are examined

Disjoint subsets of nominal inputs Disjoint ranges of ordinal / interval inputs

The spliting criterion Measures the reduction in variability of the target distribution in the

child node used to choose the split

The split choosed determines the partitioning of the observations

Partition repeted in each child node as if it were a root node of a new tree

The partition continues deeper in the tree – the process is repeated recursively until is stopped by the stopping rule

Page 9: Data  Mining

9

Splits on (at least) ordinal input

Restrictions in order to preserve the ordering Only adjacent values are grouped

Problem: To partition into B groups input with L distinct values (levels)

partitions possible splits on a single ordinal input

Any monotonic transformation of the level of the input (with at least an ordinal measurement scale) gives the same split

Page 10: Data  Mining

10

Splits on nominal input

No restrictions regarding ordering

Problem: to partition into B groups input with L distinct values (levels)

Numer of partitions: - Stirling number of the second kind count the number of ways to partition a set of L labelled

objects into B nonempty unlabelled subsets

The total number of partitions:

Page 11: Data  Mining

11

Binary splits

Ordinal input

Nominal input

Page 12: Data  Mining

12

Partitioning rule – possible variations

Incorporating some type of look-ahead or backup Often produce inferior trees have not been shown to be an improvement, Murthy and

Salzberg, 1995)

Oblique splits

Splits on lienear combination of inputs (as apposite to the standard coordinte-axis splits. i.e. boundaries parallel to the input coordinates)

Page 13: Data  Mining

13

Recursive partitioning alghorithm

Start with L-way split Collapse the two levers that are closest (based on a splitting

criterion) Repeat the process on the set of L-1 consolidated levels … split of each size. Choose teh best split for the given input Repeat the process for each input and choose the best input

CHAID algorithm Additional bacward elimination step

Number of splits to consider graatly reduced: For ordinal input: For nominal input:

Page 14: Data  Mining

14

Stopping criterion

Governs the depth and complexity of the tree

Right balance bewteen depth and complexity

When the tree is to complex: Perfect discriminantion in the training sample Lost stability Lost ability to generalise discovered patterns and relations Overfitted to the trainig sample Difficulties with interpretation of prodictive rules

Trade-off beetwen the adjustment to the training sample and ability to generalise

Page 15: Data  Mining

15

Splitting criterion

Impurity reduction Chi-square test

An exhaustive tree algorithm considers: all possible partitions Of all inputs At every node

combinatorial explosion

Page 16: Data  Mining

16

Spliting criterion

Minimise impurity within child nodes / maximise differencies between newly splited child nodes

chose the split into child nodes which: maximises the drop in inpurity resulting from the parnets node partition Maximises difference between nodes

Measures of impurity: Basic ratio Gini impurity index Entropy

Measures of difference Based on relative frequencies (classification tree) Based on target variance (regression tree)

Page 17: Data  Mining

17

Binary Decision trees

Nonparamemetric model no assumptions regarding distribution needed

Classifies observations into pre-defined groups target variable predited for the whole leafe

Supervised segmentation In the bacis case: recoursive partition into two separate

categories in order to maximise similarities of observation within the leaf and maximise differencies between leaves

Tree model = rules of segmentation No previous selection of input variable

Page 18: Data  Mining

18

Trees vs hierarchical segmentation

Hierarchical segmentation Descriptive apparoach Unsupervised

classification Segmentation based on

all variables Each partitioning based

on all variable at the time – based on distance measure

Trees Predictive appraoch Supervised

classification Segmentation based on

target variable Each partitioning based

on one variable at the time (usually)

Page 19: Data  Mining

19

Requirements

Large data sample

In case of classification trees: sufficient number of cases falling into each class of target (suggeested: min 500 cases per class)

Page 20: Data  Mining

20

Stopping criterion

The node reaches pre-defined size (e.g 10 or less cases) The algorithm has run the predefined number of generations The split results in (too) small drop of impurity Expectes losses in the testing sample

Stability of resuls in the testing sample Probabilistic assumptions regarding the variables (e.g. CHAID

algorithm)

BG TtTtGtLpBtDpTEL ||

Page 21: Data  Mining

21

Target assignement to the leaf

Frequency based Threshold needed

Cost of misclassification based α – cost of the I type error – e.g. average cost incured due to

acceptance of a „bad” credit β– cost of the II type error – e.g. average income lost due to

rejection of a „good” credit)

Page 22: Data  Mining

22

Disadvantages

Lack of stability (often)

Stability assessment on the basis of testing sample, without formal statistical inference

In case of classification tree: target value calculated in the separate step with a „simplistic” method ( dominating frequency assignement)

Target value calculated on the leaf level, not on the individual observation level

Page 23: Data  Mining

Drop of impurity ΔI

Basic Impurity Index

rIrplIlpvII

Average impurity of child nodes

5.0||

5.0||

vBpifvBpvIvGpifvGpvI

Spliting Example

Page 24: Data  Mining

Gini Impurity Index

Entropy

Pearson’s test for relative frequencies

vBpvGpvI ||

vBpvBpvGpvGpvI |ln||ln|

rnln

rGplGprnlnv

2

2 ||)(

Spliting Example

Page 25: Data  Mining

Age #G #B Odds (of beeing good)

Young 800 200 4 : 1

Medium 500 100 5 : 1

Older 300 100 3 : 1

Total 1600 400 4 : 1

How to split the ordinal (in this case) variable „age”? (young+older) vs. medium? (young+medium) vs. older?

Spliting Example

Page 26: Data  Mining

1. Young + Older= r versus Medium = l I(v)=min{400/2000 ;1600/2000}=0,2

rIrplIlpvII

p(r) = 1400/2000=0,7p(l) = 600/2000=0,3I(r) = 300/1400I(l) = 100/600

02,02,06001003,0

14003007,02,0

BIII

Spliting Example

Page 27: Data  Mining

2. Young + Medium= r versus Older = l i(v)=min{400/2000 ;1600/2000}=0,2

rIrplIlpvII p(r) = 1600/2000=0,8p(l) = 400/2000=0,2I(r) = 300/1600I(l) = 100/400

1,01,02,04001002,0

16003008,02,0

BIII

Spliting Example

Page 28: Data  Mining

1. Young + Older= r versus Medium = l p(r) = 1400/2000=0,7 p(l) = 600/2000=0,3

rIrplIlpvII

0005,01595,016,03653,0

196337,016,0

GIII

365

600100

600500)(

19633

1400300

14001100)(

16,02000400

20001600)(

li

ri

vi

Spliting Example

Page 29: Data  Mining

2. Young + Medium= r versus Older= l p(r) = 1600/2000=0,8 p(l) = 400/2000=0,2

rIrplIlpvII

0006,01594,016,01632,0

256398,016,0

GIII

163

400100

400300)(

25639

1600300

16001300)(

16,02000400

20001600)(

li

ri

vi

Spliting Example