Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost

Applied Machine LearningApplied Machine LearningDecision Trees

Siamak RavanbakhshSiamak Ravanbakhsh

COMP 551COMP 551 (winter 2020)(winter 2020)

1

decision trees:

modelcost functionhow it is optimized

how to grow a tree and why you should prune it!

Learning objectivesLearning objectives

2

Adaptive basesAdaptive bases

so far we assume a fixed set of bases in f(x) = w ϕ (x)∑d d d

several methods can be classified as learning these bases adaptively

decision treesgeneralized additive modelsboostingneural networks

f(x) = w ϕ (x; v )∑d d d d

each basis has its own parameters

3

image credit:https://mymodernmet.com/the-30second-rule-a-decision/

Decision trees: Decision trees: motivationmotivation

4

pros.decision trees are interpretable!they are not very sensitive to outliersdo not need data normalization



4

pros.decision trees are interpretable!they are not very sensitive to outliersdo not need data normalization


cons.they could easily overfit and they are unstable

pruningrandom forests


4

Decision trees: Decision trees: ideaidea

divide the input space into regions and learn one function per region

f(x) = w I(x ∈∑k k R )k

the regions are learned adaptivelymore sophisticated prediction per region is also possible (e.g., one linear model per region)

5 . 1

split regions successively based on the value of a single variable called test



f(x) = w I(x ∈∑k k R )k


5 . 1

split regions successively based on the value of a single variable called test



f(x) = w I(x ∈∑k k R )k


x 1x 2

w 1

w 3

w 5

each region is a set of conditions R =2 {x ≤1 t ,x ≤1 2 t }4

5 . 1

what constant to use for prediction in each region?

Prediction per regionPrediction per region

w k

suppose we have identified the regions R k

5 . 2



w k


fore regression

use the mean value of training data-points in that region

w =k mean(y ∣x ∈(n) (n) R )k

5 . 2



w k


fore regression



for classificationcount the frequency of classes per regionpredict the most frequent labelor return probability

5 . 2

w =k mode(y ∣x ∈(n) (n) R )k

Winter 2020 | Applied Machine Learning (COMP551)



w k


fore regression



for classificationcount the frequency of classes per regionpredict the most frequent labelor return probability

example: predicting survival in titanic

most frequent label

frequency of survival

percentage of training data in this region

5 . 2

w =k mode(y ∣x ∈(n) (n) R )k

Feature typesFeature types

given a feature what are the possible tests

6 . 1


continuous features - e.g., age, height, GDP

all the values that appear in the dataset can be used to split S =d {s =d,n x }d

(n)

one set of possible splits for each feature d

x >d s ?d,neach split is asking


6 . 1




(n)




ordinal features - e.g., grade, rating x ∈d {1, … ,C}

we can split any any value so S =d {s =d,1 1, … , s =d,C C}

x >d s ?d,ceach split is asking

6 . 1




(n)







categorical features -

- types, classes and categories

6 . 1




(n)









6 . 1

multi-way splitproblem:it could lead to sparse subsetsdata fragmentation: some splits may have few/no datapoints

{x =d

????




(n)









6 . 1


{x =d

????

instead of we have {binary splitassume C binary features (one-hot coding)

x ∈d {1, … ,C} x ∈d,1 {0, 1}

x ∈d,2 {0, 1}

⋮x ∈d,C {0, 1}

x d,2 =?

0

x d,2 =?

1




(n)









6 . 1


{x =d

????

instead of we have {binary splitassume C binary features (one-hot coding)

x ∈d {1, … ,C} x ∈d,1 {0, 1}

x ∈d,2 {0, 1}

⋮x ∈d,C {0, 1}

x d,2 =?

0

x d,2 =?

1

alternative: binary splits that produce balanced subsets

Cost functionCost function

objective: find a decision tree minimizing the cost function

6 . 2



for predicting constant

cost(R ,D) =k (y −N k

1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost

number of instances in region k

cost per region (mean squared error - MSE)

6 . 2





1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost



mean(y ∣x ∈(n) (n) R )k

6 . 2





1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost



classification cost

for predicting constant class

cost(R ,D) =k I(y =N k

1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}

cost per region (misclassification rate)


6 . 2





1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost



classification cost



1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}



6 . 2

mode(y ∣x ∈(n) (n) R )k





1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost



classification cost



1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}


total cost in both cases is the normalized sum cost(R ,D)∑k NN k

k


6 . 2



it is sometimes possible to build a tree with zero cost:build a large tree with each instance having its own region (overfitting!)




1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost



classification cost



1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}



k


6 . 2



it is sometimes possible to build a tree with zero cost:build a large tree with each instance having its own region (overfitting!)


new objective: find a decision tree with K tests minimizing the cost function



1 ∑x ∈R

(n)k

(n) w )k2

w ∈k R

regression cost



classification cost



1 ∑x ∈D

(n) R k

(n) w )k

w ∈k {1, … ,C}



k


6 . 2


Search spaceSearch space

objective: find a decision tree with K tests minimizing the cost functionK+1 regions

6 . 3

alternatively, find the smallest tree (K) that classifies all examples correctly


not produced by a decision tree


6 . 3




assuming D features how many different partitions of size K+1?


6 . 3






6 . 3

the number of full binary trees with K+1 leaves (regions ) is the Catalan numberR k K+1

1 ( K2K)

exponential in K





1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452


6 . 3


1 ( K2K)

exponential in K





1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452

we also have a choice of feature for each of K internal node DKx d


6 . 3


1 ( K2K)

exponential in K





1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452



moreover, for each feature different choices of splitting s ∈d,n S d

6 . 3


1 ( K2K)

exponential in K




bottom line: finding optimal decision tree is an NP-hard combinatorial optimization problem



1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452



moreover, for each feature different choices of splitting s ∈d,n S d

6 . 3


1 ( K2K)

exponential in K


Greedy heuristicGreedy heuristic

end the recursion if not worth-splitting

recursively split the regions based on a greedy choice of the next test

7 . 1




function fittree( , ,depth)R node

if not worthsplitting(depth, )

return

else

leftset = fittree( , , depth+1)

rightset = fittree( , , depth+1)

return {leftset, rightset}

= greedytest ( , )

D

R node DR ,R left right

R ,R left right

R node

R left DR right D

7 . 1




function fittree( , ,depth)R node

if not worthsplitting(depth, )

return

else

leftset = fittree( , , depth+1)

rightset = fittree( , , depth+1)

return {leftset, rightset}

= greedytest ( , )

D

R node DR ,R left right

R ,R left right

R node

R left DR right D

final decision tree in the form of nested list of regions

{{R ,R }, {R , {R ,R }}1 2 3 4 5

7 . 1

Choosing testsChoosing tests

the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost

7 . 2



= cost(R ,D) +N node

N leftleft cost(R ,D)N node

N rightright

function greedytest ( , )R node D

R =left R ∪node {x <d s }d,n

for d ∈ {1, … ,D}, s ∈d,n S d

R =right R ∪node {x ≥d s }d,n

splitcost

bestcost = inf

if splitcost < bestcost:

bestcost = splitcost

R =left∗ R left

R =right∗ R right

return R ,R left∗

right∗

7 . 2





N rightright



for d ∈ {1, … ,D}, s ∈d,n S d


splitcost

bestcost = inf



R =left∗ R left

R =right∗ R right

return R ,R left∗

right∗

7 . 2

creating new regions





N rightright



for d ∈ {1, … ,D}, s ∈d,n S d


splitcost

bestcost = inf



R =left∗ R left

R =right∗ R right

return R ,R left∗

right∗

evaluate their cost

7 . 2






N rightright



for d ∈ {1, … ,D}, s ∈d,n S d


splitcost

bestcost = inf



R =left∗ R left

R =right∗ R right

return R ,R left∗

right∗

evaluate their cost

7 . 2


return the split with the lowest greedy cost

Stopping the recursionStopping the recursion worthsplitting subroutine

if we stop when has zero cost, we may overfitR node

image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/

7 . 3



heuristics for stopping the splitting:


7 . 3




reached a desired depth


7 . 3




reached a desired depthnumber of examples in or is too smallR left R right


7 . 3




reached a desired depthnumber of examples in or is too small is a good approximation, the cost is small enough

R left R right


7 . 3




reached a desired depthnumber of examples in or is too small is a good approximation, the cost is small enough

R left R right

w k


7 . 3





reached a desired depthnumber of examples in or is too small is a good approximation, the cost is small enoughreduction in cost by splitting is small

R left R right

w k

cost(R ,D) −node ( cost(R ,D) +N node

N leftleft cost(R ,D))N node

N rightright


7 . 3

revisiting therevisiting the classification cost classification cost

ideally we want to optimize the 0-1 loss (misclassification rate)


1 ∑x ∈R

(n)k

(n) w )k

this may not be the optimal cost for each step of greedy heuristic

8 . 1




1 ∑x ∈R

(n)k

(n) w )k


example

8 . 1

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right




1 ∑x ∈R

(n)k

(n) w )k


example

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 1

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right




1 ∑x ∈R

(n)k

(n) w )k


example

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 1

both splits have the same misclassification rate (2/8)

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right




1 ∑x ∈R

(n)k

(n) w )k


example

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 1


however the second split may be preferable because one region does not need further splitting

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right




1 ∑x ∈R

(n)k

(n) w )k


example

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 1


however the second split may be preferable because one region does not need further splitting

(.5, 100%)

(.25, 50%) (.75, 50%)

R node

R left R right

use a measure for homogeneity of labels in regions

EntropyEntropy

entropy is the expected amount of information in observing a random variable

H(y) = − p(y =∑c=1C

c) log p(y = c)

ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)

8 . 2

EntropyEntropy


H(y) = − p(y =∑c=1C

c) log p(y = c)

is the amount of information in observing c− log p(y = c)


8 . 2

EntropyEntropy


H(y) = − p(y =∑c=1C

c) log p(y = c)


zero information of p(c)=1less probable events are more informativeinformation from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d)

p(c) < p(c ) ⇒′ − log p(c) > − log p(c )′


8 . 2

EntropyEntropy


H(y) = − p(y =∑c=1C

c) log p(y = c)

a uniform distribution has the highest entropy H(y) = − log =∑c=1C

C1

C1 log C



p(c) < p(c ) ⇒′ − log p(c) > − log p(c )′


8 . 2

EntropyEntropy


H(y) = − p(y =∑c=1C

c) log p(y = c)

a uniform distribution has the highest entropy H(y) = − log =∑c=1C

C1

C1 log C

a deterministic random variable has the lowest entropy H(y) = −1 log(1) = 0



p(c) < p(c ) ⇒′ − log p(c) > − log p(c )′


8 . 2

Mutual informationMutual information

for two random variables t, y

8 . 3



the amount of information t conveys about ychange in the entropy of y after observing the value of t

mutual information is

8 . 3


I(t, y) = H(y) − H(y∣t)




8 . 3


I(t, y) = H(y) − H(y∣t)conditional entropy p(t =∑l=1

Ll)H(x∣t = l)




8 . 3



Ll)H(x∣t = l)


= p(y =∑l∑c c, t = l) log

p(y=c)p(t=l)p(y=c,t=l)

this is symmetric wrt y and t



8 . 3



Ll)H(x∣t = l)


= H(t) − H(t∣y) = I(y, t)






8 . 3



Ll)H(x∣t = l)

it is always positive and zero only if y and t are independent


= H(t) − H(t∣y) = I(y, t)

try to prove these properties






8 . 3

Entropy for classification costEntropy for classification cost

we care about the distribution of labels p (y =k c) = N k

I(y =c)∑x ∈R

(n)k

(n)

8 . 4



I(y =c)∑x ∈R

(n)k

(n)


1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification cost

8 . 4



I(y =c)∑x ∈R

(n)k

(n)


1 ∑x ∈R

(n)k

(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k

8 . 4



I(y =c)∑x ∈R

(n)k

(n)


1 ∑x ∈R

(n)k


entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy

8 . 4



I(y =c)∑x ∈R

(n)k

(n)


1 ∑x ∈R

(n)k



change in the cost becomes the mutual information between the test and labels

8 . 4



I(y =c)∑x ∈R

(n)k

(n)


1 ∑x ∈R

(n)k




N leftleft cost(R ,D))

N node

N leftright


8 . 4



I(y =c)∑x ∈R

(n)k

(n)


1 ∑x ∈R

(n)k





N node

N leftright


= H(y) − (p(x ≥d s )H(p(y∣x ≥d,n d s )) +d,n p(x <d s )H(p(y∣x <d,n d s )))d,n

8 . 4



I(y =c)∑x ∈R

(n)k

(n)


1 ∑x ∈R

(n)k





N node

N leftright


= H(y) − (p(x ≥d s )H(p(y∣x ≥d,n d s )) +d,n p(x <d s )H(p(y∣x <d,n d s )))d,n = I(y,x > s )d,n

8 . 4



I(y =c)∑x ∈R

(n)k

(n)


1 ∑x ∈R

(n)k





N node

N leftright


= H(y) − (p(x ≥d s )H(p(y∣x ≥d,n d s )) +d,n p(x <d s )H(p(y∣x <d,n d s )))d,n = I(y,x > s )d,n

choosing the test which is maximally informative about labels

8 . 4


example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right


example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right

misclassification cost

⋅84

+41

⋅84

=41

41


example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right


⋅84

+41

⋅84

=41

41

⋅86

+31

⋅82

=20

41


example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right


⋅84

+41

⋅84

=41

41

⋅86

+31

⋅82

=20

41

the same costs


example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right


⋅84

+41

⋅84

=41

41

⋅86

+31

⋅82

=20

41

entropy cost (using base 2 logarithm)

the same costs


example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right


⋅84

+41

⋅84

=41

41

⋅86

+31

⋅82

=20

41


( −84

log( ) −41

41

log( )) +43

43

( −84

log( ) −41

41

log( )) ≈43

43 .81

the same costs


example

(.5, 100%)

(.25, 50%) (.75, 50%)

(.5, 100%)

(.33, 75%) (1, 25%)

8 . 5

R node

R left R right


⋅84

+41

⋅84

=41

41

⋅86

+31

⋅82

=20

41


( −84

log( ) −41

41

log( )) +43

43

( −84

log( ) −41

41

log( )) ≈43

43 .81

( −86

log( ) −31

31

log( )) +32

32

⋅82 0 ≈ .68

lower cost split

the same costs

Gini indexGini index


1 ∑x ∈R

(n)k

(n) w ) =k 1 − p(w )kmisclassification (error) rate

entropy cost(R ,D) =k H(y)

another cost for selecting the test in classification

8 . 6



1 ∑x ∈R

(n)k



Gini index it is the expected error rate


8 . 6



1 ∑x ∈R

(n)k




cost(R ,D) =k p(c)(1 −∑c=1C

p(c))probability of class c probability of error


8 . 6



1 ∑x ∈R

(n)k



= p(c) −∑c=1C

p(c) =∑c=1C 2 1 − p(c)∑c=1

C 2





8 . 6




1 ∑x ∈R

(n)k



= p(c) −∑c=1C

p(c) =∑c=1C 2 1 − p(c)∑c=1

C 2





comparison of costs of a node when we have 2 classes

p(y = 1)

cost

8 . 6

ExampleExample

decision tree for Iris dataset

decision boundariesdataset (D=2) decision tree

9 . 1

ExampleExample


decision boundariesdataset (D=2)

1

decision tree

9 . 1

ExampleExample



1

2

decision tree

9 . 1

ExampleExample



1

2

3

decision tree

9 . 1

ExampleExample



decision boundaries suggest overfitting

confirmed using a validation set

training accuracy ~ 85%(Cross) validation accuracy ~ 70% 1

2

3

decision tree

9 . 1

OverfittingOverfitting

a decision tree can fit any Boolean function (binary classification with binary features)

image credit: https://www.wikiwand.com/en/Binary_decision_diagram

9 . 2




example: of decision tree representation of a boolean function (D=3)

9 . 2





there are such functions, why?22D

9 . 2






large decision trees have a high variance - low bias (low training error, high test error)

9 . 2







idea 1. grow a small tree

9 . 2








9 . 2

substantial reduction in cost may happen after a few steps

by stopping early we cannot know this








example cost drops after the second node

9 . 2

substantial reduction in cost may happen after a few steps

by stopping early we cannot know this

PruningPruning idea 2. grow a large tree and then prune it

9 . 3


greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node

9 . 3



pick the best among the above models using using a validation set

9 . 3




example

before pruning

9 . 3




after pruning

example

before pruning

9 . 3




after pruning cross-validation is used to pickthe best size

example

before pruning

9 . 3





after pruning cross-validation is used to pickthe best size

example

before pruning

idea 3. random forests (later!)

9 . 3

SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classification

10

SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classificationoptimization:

NP-harduse greedy heuristic

10



adjust the cost for the heuristicusing entropy (relation to mutual information maximization)using Gini index

10




decision trees are unstable (have high variance)use pruning to avoid overfitting

10




decision trees are unstable (have high variance)use pruning to avoid overfitting

there are variations on decision tree heuristicswhat we discussed in called Classification and Regression Trees (CART)

10

Documents

Decision Trees Applied Machine Learning · Applied Machine Learning Decision Trees S ia m a k R a v a n b a k h s h CO M P 5 5 1 ( w in t e r 2 0 2 0 ) 1. decision trees: model cost