Upload
catrin
View
45
Download
0
Embed Size (px)
DESCRIPTION
Data Mining. Decision Trees. dr Iwona Schab. Decision Trees. Method of classification Recursive procedure which ( progressively ) divides sets of n units into groups accoridng to a division rule - PowerPoint PPT Presentation
Citation preview
1
Data Mining
dr Iwona Schab
Decision Trees
2
Decision Trees
Method of classification Recursive procedure which (progressively) divides sets of n
units into groups accoridng to a division rule
Designed for supervised prediction problems (i.e. a set of input variables is used to prodict the value of target variable
The primary goal is prediction The fitted tree model is used for target variable prediction for
new cases (i.e. to score new cases/data)
Result: a final partition of the observations the Boolean rules needed to score new data
3
Decision Tree
A predictive model represented in a tree-like structure
Root node
A split based on the values of the input
Terminal node – the leaf
Internal node
4
Decission tree
Nonparametric method Allows for nonlinear relationships modelling Sound concept, Easy to interpret Robustness against outliers Detection and taking into accout of potential interactions
between input variables Additional implementation: categorisation of continiuos
variables, grouping of nopminal valueds
5
Decision Trees
Types: Classification trees (Categorical response variable)
the leafs give the predicted class and the probability of class membership
Regression trees (Continous response variable) the leafs give the predicted value of the target
Exemplary applications: Handwriting recognition Medical research Financial and capital markets
6
Decision Tree
The path to each leaf expresses as a Boolean rule: if … then … The ’regions’ of the input space determined by the split values Intersections of subspaces defined by a single splitting variable Regression tree model is a multivariate step function Leaves represent the perdicted target
All cases in a particular leaf are given the same predicted target
Splits: Binary Multiway splits (inputs partitioned into disjoined ranges)
7
Analytical decision
Recursive partitioning rule / splitting criterion
Pruning criterion / stopping criterion
Assignement of predicted target variable
8
Recursive partitioning rule
Method used to fit the tree Top-dow, greedy algorithm Starts at the root node Splits involving each single input are examined
Disjoint subsets of nominal inputs Disjoint ranges of ordinal / interval inputs
The spliting criterion Measures the reduction in variability of the target distribution in the
child node used to choose the split
The split choosed determines the partitioning of the observations
Partition repeted in each child node as if it were a root node of a new tree
The partition continues deeper in the tree – the process is repeated recursively until is stopped by the stopping rule
9
Splits on (at least) ordinal input
Restrictions in order to preserve the ordering Only adjacent values are grouped
Problem: To partition into B groups input with L distinct values (levels)
partitions possible splits on a single ordinal input
Any monotonic transformation of the level of the input (with at least an ordinal measurement scale) gives the same split
10
Splits on nominal input
No restrictions regarding ordering
Problem: to partition into B groups input with L distinct values (levels)
Numer of partitions: - Stirling number of the second kind count the number of ways to partition a set of L labelled
objects into B nonempty unlabelled subsets
The total number of partitions:
11
Binary splits
Ordinal input
Nominal input
12
Partitioning rule – possible variations
Incorporating some type of look-ahead or backup Often produce inferior trees have not been shown to be an improvement, Murthy and
Salzberg, 1995)
Oblique splits
Splits on lienear combination of inputs (as apposite to the standard coordinte-axis splits. i.e. boundaries parallel to the input coordinates)
13
Recursive partitioning alghorithm
Start with L-way split Collapse the two levers that are closest (based on a splitting
criterion) Repeat the process on the set of L-1 consolidated levels … split of each size. Choose teh best split for the given input Repeat the process for each input and choose the best input
CHAID algorithm Additional bacward elimination step
Number of splits to consider graatly reduced: For ordinal input: For nominal input:
14
Stopping criterion
Governs the depth and complexity of the tree
Right balance bewteen depth and complexity
When the tree is to complex: Perfect discriminantion in the training sample Lost stability Lost ability to generalise discovered patterns and relations Overfitted to the trainig sample Difficulties with interpretation of prodictive rules
Trade-off beetwen the adjustment to the training sample and ability to generalise
15
Splitting criterion
Impurity reduction Chi-square test
An exhaustive tree algorithm considers: all possible partitions Of all inputs At every node
combinatorial explosion
16
Spliting criterion
Minimise impurity within child nodes / maximise differencies between newly splited child nodes
chose the split into child nodes which: maximises the drop in inpurity resulting from the parnets node partition Maximises difference between nodes
Measures of impurity: Basic ratio Gini impurity index Entropy
Measures of difference Based on relative frequencies (classification tree) Based on target variance (regression tree)
17
Binary Decision trees
Nonparamemetric model no assumptions regarding distribution needed
Classifies observations into pre-defined groups target variable predited for the whole leafe
Supervised segmentation In the bacis case: recoursive partition into two separate
categories in order to maximise similarities of observation within the leaf and maximise differencies between leaves
Tree model = rules of segmentation No previous selection of input variable
18
Trees vs hierarchical segmentation
Hierarchical segmentation Descriptive apparoach Unsupervised
classification Segmentation based on
all variables Each partitioning based
on all variable at the time – based on distance measure
Trees Predictive appraoch Supervised
classification Segmentation based on
target variable Each partitioning based
on one variable at the time (usually)
19
Requirements
Large data sample
In case of classification trees: sufficient number of cases falling into each class of target (suggeested: min 500 cases per class)
20
Stopping criterion
The node reaches pre-defined size (e.g 10 or less cases) The algorithm has run the predefined number of generations The split results in (too) small drop of impurity Expectes losses in the testing sample
Stability of resuls in the testing sample Probabilistic assumptions regarding the variables (e.g. CHAID
algorithm)
BG TtTtGtLpBtDpTEL ||
21
Target assignement to the leaf
Frequency based Threshold needed
Cost of misclassification based α – cost of the I type error – e.g. average cost incured due to
acceptance of a „bad” credit β– cost of the II type error – e.g. average income lost due to
rejection of a „good” credit)
22
Disadvantages
Lack of stability (often)
Stability assessment on the basis of testing sample, without formal statistical inference
In case of classification tree: target value calculated in the separate step with a „simplistic” method ( dominating frequency assignement)
Target value calculated on the leaf level, not on the individual observation level
Drop of impurity ΔI
Basic Impurity Index
rIrplIlpvII
Average impurity of child nodes
5.0||
5.0||
vBpifvBpvIvGpifvGpvI
Spliting Example
Gini Impurity Index
Entropy
Pearson’s test for relative frequencies
vBpvGpvI ||
vBpvBpvGpvGpvI |ln||ln|
rnln
rGplGprnlnv
2
2 ||)(
Spliting Example
Age #G #B Odds (of beeing good)
Young 800 200 4 : 1
Medium 500 100 5 : 1
Older 300 100 3 : 1
Total 1600 400 4 : 1
How to split the ordinal (in this case) variable „age”? (young+older) vs. medium? (young+medium) vs. older?
Spliting Example
1. Young + Older= r versus Medium = l I(v)=min{400/2000 ;1600/2000}=0,2
rIrplIlpvII
p(r) = 1400/2000=0,7p(l) = 600/2000=0,3I(r) = 300/1400I(l) = 100/600
02,02,06001003,0
14003007,02,0
BIII
Spliting Example
2. Young + Medium= r versus Older = l i(v)=min{400/2000 ;1600/2000}=0,2
rIrplIlpvII p(r) = 1600/2000=0,8p(l) = 400/2000=0,2I(r) = 300/1600I(l) = 100/400
1,01,02,04001002,0
16003008,02,0
BIII
Spliting Example
1. Young + Older= r versus Medium = l p(r) = 1400/2000=0,7 p(l) = 600/2000=0,3
rIrplIlpvII
0005,01595,016,03653,0
196337,016,0
GIII
365
600100
600500)(
19633
1400300
14001100)(
16,02000400
20001600)(
li
ri
vi
Spliting Example
2. Young + Medium= r versus Older= l p(r) = 1600/2000=0,8 p(l) = 400/2000=0,2
rIrplIlpvII
0006,01594,016,01632,0
256398,016,0
GIII
163
400100
400300)(
25639
1600300
16001300)(
16,02000400
20001600)(
li
ri
vi
Spliting Example