Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Applied Machine LearningApplied Machine LearningDecision Trees
Siamak RavanbakhshSiamak Ravanbakhsh
COMP 551COMP 551 (winter 2020)(winter 2020)
1
decision trees:
modelcost functionhow it is optimized
how to grow a tree and why you should prune it!
Learning objectivesLearning objectives
2
Adaptive basesAdaptive bases
so far we assume a fixed set of bases in f(x) = w ϕ (x)∑d d d
several methods can be classified as learning these bases adaptively
decision treesgeneralized additive modelsboostingneural networks
f(x) = w ϕ (x; v )∑d d d d
each basis has its own parameters
3
image credit:https://mymodernmet.com/the-30second-rule-a-decision/
Decision trees: Decision trees: motivationmotivation
4
pros.decision trees are interpretable!they are not very sensitive to outliersdo not need data normalization
image credit:https://mymodernmet.com/the-30second-rule-a-decision/
Decision trees: Decision trees: motivationmotivation
4
pros.decision trees are interpretable!they are not very sensitive to outliersdo not need data normalization
image credit:https://mymodernmet.com/the-30second-rule-a-decision/
cons.they could easily overfit and they are unstable
pruningrandom forests
Decision trees: Decision trees: motivationmotivation
4
Decision trees: Decision trees: ideaidea
divide the input space into regions and learn one function per region
f(x) = w I(x ∈∑k k R )k
the regions are learned adaptivelymore sophisticated prediction per region is also possible (e.g., one linear model per region)
5 . 1
split regions successively based on the value of a single variable called test
Decision trees: Decision trees: ideaidea
divide the input space into regions and learn one function per region
f(x) = w I(x ∈∑k k R )k
the regions are learned adaptivelymore sophisticated prediction per region is also possible (e.g., one linear model per region)
5 . 1
split regions successively based on the value of a single variable called test
Decision trees: Decision trees: ideaidea
divide the input space into regions and learn one function per region
f(x) = w I(x ∈∑k k R )k
the regions are learned adaptivelymore sophisticated prediction per region is also possible (e.g., one linear model per region)
x 1x 2
w 1
w 3
w 5
each region is a set of conditions R =2 {x ≤1 t ,x ≤1 2 t }4
5 . 1
what constant to use for prediction in each region?
Prediction per regionPrediction per region
w k
suppose we have identified the regions R k
5 . 2
what constant to use for prediction in each region?
Prediction per regionPrediction per region
w k
suppose we have identified the regions R k
fore regression
use the mean value of training data-points in that region
w =k mean(y ∣x ∈(n) (n) R )k
5 . 2
what constant to use for prediction in each region?
Prediction per regionPrediction per region
w k
suppose we have identified the regions R k
fore regression
use the mean value of training data-points in that region
w =k mean(y ∣x ∈(n) (n) R )k
for classificationcount the frequency of classes per regionpredict the most frequent labelor return probability
5 . 2
w =k mode(y ∣x ∈(n) (n) R )k
Winter 2020 | Applied Machine Learning (COMP551)
what constant to use for prediction in each region?
Prediction per regionPrediction per region
w k
suppose we have identified the regions R k
fore regression
use the mean value of training data-points in that region
w =k mean(y ∣x ∈(n) (n) R )k
for classificationcount the frequency of classes per regionpredict the most frequent labelor return probability
example: predicting survival in titanic
most frequent label
frequency of survival
percentage of training data in this region
5 . 2
w =k mode(y ∣x ∈(n) (n) R )k
Feature typesFeature types
given a feature what are the possible tests
6 . 1
Feature typesFeature types
continuous features - e.g., age, height, GDP
all the values that appear in the dataset can be used to split S =d {s =d,n x }d
(n)
one set of possible splits for each feature d
x >d s ?d,neach split is asking
given a feature what are the possible tests
6 . 1
Feature typesFeature types
continuous features - e.g., age, height, GDP
all the values that appear in the dataset can be used to split S =d {s =d,n x }d
(n)
one set of possible splits for each feature d
x >d s ?d,neach split is asking
given a feature what are the possible tests
ordinal features - e.g., grade, rating x ∈d {1, … ,C}
we can split any any value so S =d {s =d,1 1, … , s =d,C C}
x >d s ?d,ceach split is asking
6 . 1
Feature typesFeature types
continuous features - e.g., age, height, GDP
all the values that appear in the dataset can be used to split S =d {s =d,n x }d
(n)
one set of possible splits for each feature d
x >d s ?d,neach split is asking
given a feature what are the possible tests
ordinal features - e.g., grade, rating x ∈d {1, … ,C}
we can split any any value so S =d {s =d,1 1, … , s =d,C C}
x >d s ?d,ceach split is asking
categorical features -
- types, classes and categories
6 . 1
Feature typesFeature types
continuous features - e.g., age, height, GDP
all the values that appear in the dataset can be used to split S =d {s =d,n x }d
(n)
one set of possible splits for each feature d
x >d s ?d,neach split is asking
given a feature what are the possible tests
ordinal features - e.g., grade, rating x ∈d {1, … ,C}
we can split any any value so S =d {s =d,1 1, … , s =d,C C}
x >d s ?d,ceach split is asking
categorical features -
- types, classes and categories
6 . 1
multi-way splitproblem:it could lead to sparse subsetsdata fragmentation: some splits may have few/no datapoints
{x =d
????
Feature typesFeature types
continuous features - e.g., age, height, GDP
all the values that appear in the dataset can be used to split S =d {s =d,n x }d
(n)
one set of possible splits for each feature d
x >d s ?d,neach split is asking
given a feature what are the possible tests
ordinal features - e.g., grade, rating x ∈d {1, … ,C}
we can split any any value so S =d {s =d,1 1, … , s =d,C C}
x >d s ?d,ceach split is asking
categorical features -
- types, classes and categories
6 . 1
multi-way splitproblem:it could lead to sparse subsetsdata fragmentation: some splits may have few/no datapoints
{x =d
????
instead of we have {binary splitassume C binary features (one-hot coding)
x ∈d {1, … ,C} x ∈d,1 {0, 1}
x ∈d,2 {0, 1}
⋮x ∈d,C {0, 1}
x d,2 =?
0
x d,2 =?
1
Feature typesFeature types
continuous features - e.g., age, height, GDP
all the values that appear in the dataset can be used to split S =d {s =d,n x }d
(n)
one set of possible splits for each feature d
x >d s ?d,neach split is asking
given a feature what are the possible tests
ordinal features - e.g., grade, rating x ∈d {1, … ,C}
we can split any any value so S =d {s =d,1 1, … , s =d,C C}
x >d s ?d,ceach split is asking
categorical features -
- types, classes and categories
6 . 1
multi-way splitproblem:it could lead to sparse subsetsdata fragmentation: some splits may have few/no datapoints
{x =d
????
instead of we have {binary splitassume C binary features (one-hot coding)
x ∈d {1, … ,C} x ∈d,1 {0, 1}
x ∈d,2 {0, 1}
⋮x ∈d,C {0, 1}
x d,2 =?
0
x d,2 =?
1
alternative: binary splits that produce balanced subsets
Cost functionCost function
objective: find a decision tree minimizing the cost function
6 . 2
Cost functionCost function
objective: find a decision tree minimizing the cost function
for predicting constant
cost(R ,D) =k (y −N k
1 ∑x ∈R
(n)k
(n) w )k2
w ∈k R
regression cost
number of instances in region k
cost per region (mean squared error - MSE)
6 . 2
Cost functionCost function
objective: find a decision tree minimizing the cost function
for predicting constant
cost(R ,D) =k (y −N k
1 ∑x ∈R
(n)k
(n) w )k2
w ∈k R
regression cost
number of instances in region k
cost per region (mean squared error - MSE)
mean(y ∣x ∈(n) (n) R )k
6 . 2
Cost functionCost function
objective: find a decision tree minimizing the cost function
for predicting constant
cost(R ,D) =k (y −N k
1 ∑x ∈R
(n)k
(n) w )k2
w ∈k R
regression cost
number of instances in region k
cost per region (mean squared error - MSE)
classification cost
for predicting constant class
cost(R ,D) =k I(y =N k
1 ∑x ∈D
(n) R k
(n) w )k
w ∈k {1, … ,C}
cost per region (misclassification rate)
mean(y ∣x ∈(n) (n) R )k
6 . 2
Cost functionCost function
objective: find a decision tree minimizing the cost function
for predicting constant
cost(R ,D) =k (y −N k
1 ∑x ∈R
(n)k
(n) w )k2
w ∈k R
regression cost
number of instances in region k
cost per region (mean squared error - MSE)
classification cost
for predicting constant class
cost(R ,D) =k I(y =N k
1 ∑x ∈D
(n) R k
(n) w )k
w ∈k {1, … ,C}
cost per region (misclassification rate)
mean(y ∣x ∈(n) (n) R )k
6 . 2
mode(y ∣x ∈(n) (n) R )k
Cost functionCost function
objective: find a decision tree minimizing the cost function
for predicting constant
cost(R ,D) =k (y −N k
1 ∑x ∈R
(n)k
(n) w )k2
w ∈k R
regression cost
number of instances in region k
cost per region (mean squared error - MSE)
classification cost
for predicting constant class
cost(R ,D) =k I(y =N k
1 ∑x ∈D
(n) R k
(n) w )k
w ∈k {1, … ,C}
cost per region (misclassification rate)
total cost in both cases is the normalized sum cost(R ,D)∑k NN k
k
mean(y ∣x ∈(n) (n) R )k
6 . 2
mode(y ∣x ∈(n) (n) R )k
Cost functionCost function
it is sometimes possible to build a tree with zero cost:build a large tree with each instance having its own region (overfitting!)
objective: find a decision tree minimizing the cost function
for predicting constant
cost(R ,D) =k (y −N k
1 ∑x ∈R
(n)k
(n) w )k2
w ∈k R
regression cost
number of instances in region k
cost per region (mean squared error - MSE)
classification cost
for predicting constant class
cost(R ,D) =k I(y =N k
1 ∑x ∈D
(n) R k
(n) w )k
w ∈k {1, … ,C}
cost per region (misclassification rate)
total cost in both cases is the normalized sum cost(R ,D)∑k NN k
k
mean(y ∣x ∈(n) (n) R )k
6 . 2
mode(y ∣x ∈(n) (n) R )k
Cost functionCost function
it is sometimes possible to build a tree with zero cost:build a large tree with each instance having its own region (overfitting!)
objective: find a decision tree minimizing the cost function
new objective: find a decision tree with K tests minimizing the cost function
for predicting constant
cost(R ,D) =k (y −N k
1 ∑x ∈R
(n)k
(n) w )k2
w ∈k R
regression cost
number of instances in region k
cost per region (mean squared error - MSE)
classification cost
for predicting constant class
cost(R ,D) =k I(y =N k
1 ∑x ∈D
(n) R k
(n) w )k
w ∈k {1, … ,C}
cost per region (misclassification rate)
total cost in both cases is the normalized sum cost(R ,D)∑k NN k
k
mean(y ∣x ∈(n) (n) R )k
6 . 2
mode(y ∣x ∈(n) (n) R )k
Search spaceSearch space
objective: find a decision tree with K tests minimizing the cost functionK+1 regions
6 . 3
alternatively, find the smallest tree (K) that classifies all examples correctly
Search spaceSearch space
not produced by a decision tree
objective: find a decision tree with K tests minimizing the cost functionK+1 regions
6 . 3
alternatively, find the smallest tree (K) that classifies all examples correctly
Search spaceSearch space
not produced by a decision tree
assuming D features how many different partitions of size K+1?
objective: find a decision tree with K tests minimizing the cost functionK+1 regions
6 . 3
alternatively, find the smallest tree (K) that classifies all examples correctly
Search spaceSearch space
not produced by a decision tree
assuming D features how many different partitions of size K+1?
objective: find a decision tree with K tests minimizing the cost functionK+1 regions
6 . 3
the number of full binary trees with K+1 leaves (regions ) is the Catalan numberR k K+1
1 ( K2K)
exponential in K
alternatively, find the smallest tree (K) that classifies all examples correctly
Search spaceSearch space
not produced by a decision tree
assuming D features how many different partitions of size K+1?
1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452
objective: find a decision tree with K tests minimizing the cost functionK+1 regions
6 . 3
the number of full binary trees with K+1 leaves (regions ) is the Catalan numberR k K+1
1 ( K2K)
exponential in K
alternatively, find the smallest tree (K) that classifies all examples correctly
Search spaceSearch space
not produced by a decision tree
assuming D features how many different partitions of size K+1?
1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452
we also have a choice of feature for each of K internal node DKx d
objective: find a decision tree with K tests minimizing the cost functionK+1 regions
6 . 3
the number of full binary trees with K+1 leaves (regions ) is the Catalan numberR k K+1
1 ( K2K)
exponential in K
alternatively, find the smallest tree (K) that classifies all examples correctly
Search spaceSearch space
not produced by a decision tree
assuming D features how many different partitions of size K+1?
1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452
we also have a choice of feature for each of K internal node DKx d
objective: find a decision tree with K tests minimizing the cost functionK+1 regions
moreover, for each feature different choices of splitting s ∈d,n S d
6 . 3
the number of full binary trees with K+1 leaves (regions ) is the Catalan numberR k K+1
1 ( K2K)
exponential in K
alternatively, find the smallest tree (K) that classifies all examples correctly
Winter 2020 | Applied Machine Learning (COMP551)
Search spaceSearch space
bottom line: finding optimal decision tree is an NP-hard combinatorial optimization problem
not produced by a decision tree
assuming D features how many different partitions of size K+1?
1, 1, 2, 5, 14, 42, 132, 429, 1430, 4862, 16796, 58786, 208012, 742900, 2674440, 9694845, 35357670, 129644790, 477638700, 1767263190,6564120420, 24466267020, 91482563640, 343059613650, 1289904147324, 4861946401452
we also have a choice of feature for each of K internal node DKx d
objective: find a decision tree with K tests minimizing the cost functionK+1 regions
moreover, for each feature different choices of splitting s ∈d,n S d
6 . 3
the number of full binary trees with K+1 leaves (regions ) is the Catalan numberR k K+1
1 ( K2K)
exponential in K
alternatively, find the smallest tree (K) that classifies all examples correctly
Greedy heuristicGreedy heuristic
end the recursion if not worth-splitting
recursively split the regions based on a greedy choice of the next test
7 . 1
Greedy heuristicGreedy heuristic
end the recursion if not worth-splitting
recursively split the regions based on a greedy choice of the next test
function fittree( , ,depth)R node
if not worthsplitting(depth, )
return
else
leftset = fittree( , , depth+1)
rightset = fittree( , , depth+1)
return {leftset, rightset}
= greedytest ( , )
D
R node DR ,R left right
R ,R left right
R node
R left DR right D
7 . 1
Greedy heuristicGreedy heuristic
end the recursion if not worth-splitting
recursively split the regions based on a greedy choice of the next test
function fittree( , ,depth)R node
if not worthsplitting(depth, )
return
else
leftset = fittree( , , depth+1)
rightset = fittree( , , depth+1)
return {leftset, rightset}
= greedytest ( , )
D
R node DR ,R left right
R ,R left right
R node
R left DR right D
final decision tree in the form of nested list of regions
{{R ,R }, {R , {R ,R }}1 2 3 4 5
7 . 1
Choosing testsChoosing tests
the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost
7 . 2
Choosing testsChoosing tests
the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost
= cost(R ,D) +N node
N leftleft cost(R ,D)N node
N rightright
function greedytest ( , )R node D
R =left R ∪node {x <d s }d,n
for d ∈ {1, … ,D}, s ∈d,n S d
R =right R ∪node {x ≥d s }d,n
splitcost
bestcost = inf
if splitcost < bestcost:
bestcost = splitcost
R =left∗ R left
R =right∗ R right
return R ,R left∗
right∗
7 . 2
Choosing testsChoosing tests
the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost
= cost(R ,D) +N node
N leftleft cost(R ,D)N node
N rightright
function greedytest ( , )R node D
R =left R ∪node {x <d s }d,n
for d ∈ {1, … ,D}, s ∈d,n S d
R =right R ∪node {x ≥d s }d,n
splitcost
bestcost = inf
if splitcost < bestcost:
bestcost = splitcost
R =left∗ R left
R =right∗ R right
return R ,R left∗
right∗
7 . 2
creating new regions
Choosing testsChoosing tests
the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost
= cost(R ,D) +N node
N leftleft cost(R ,D)N node
N rightright
function greedytest ( , )R node D
R =left R ∪node {x <d s }d,n
for d ∈ {1, … ,D}, s ∈d,n S d
R =right R ∪node {x ≥d s }d,n
splitcost
bestcost = inf
if splitcost < bestcost:
bestcost = splitcost
R =left∗ R left
R =right∗ R right
return R ,R left∗
right∗
evaluate their cost
7 . 2
creating new regions
Choosing testsChoosing tests
the split is greedy because it looks one step aheadthis may not lead to the the lowest overall cost
= cost(R ,D) +N node
N leftleft cost(R ,D)N node
N rightright
function greedytest ( , )R node D
R =left R ∪node {x <d s }d,n
for d ∈ {1, … ,D}, s ∈d,n S d
R =right R ∪node {x ≥d s }d,n
splitcost
bestcost = inf
if splitcost < bestcost:
bestcost = splitcost
R =left∗ R left
R =right∗ R right
return R ,R left∗
right∗
evaluate their cost
7 . 2
creating new regions
return the split with the lowest greedy cost
Stopping the recursionStopping the recursion worthsplitting subroutine
if we stop when has zero cost, we may overfitR node
image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/
7 . 3
Stopping the recursionStopping the recursion worthsplitting subroutine
if we stop when has zero cost, we may overfitR node
heuristics for stopping the splitting:
image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/
7 . 3
Stopping the recursionStopping the recursion worthsplitting subroutine
if we stop when has zero cost, we may overfitR node
heuristics for stopping the splitting:
reached a desired depth
image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/
7 . 3
Stopping the recursionStopping the recursion worthsplitting subroutine
if we stop when has zero cost, we may overfitR node
heuristics for stopping the splitting:
reached a desired depthnumber of examples in or is too smallR left R right
image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/
7 . 3
Stopping the recursionStopping the recursion worthsplitting subroutine
if we stop when has zero cost, we may overfitR node
heuristics for stopping the splitting:
reached a desired depthnumber of examples in or is too small is a good approximation, the cost is small enough
R left R right
image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/
7 . 3
Stopping the recursionStopping the recursion worthsplitting subroutine
if we stop when has zero cost, we may overfitR node
heuristics for stopping the splitting:
reached a desired depthnumber of examples in or is too small is a good approximation, the cost is small enough
R left R right
w k
image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/
7 . 3
Winter 2020 | Applied Machine Learning (COMP551)
Stopping the recursionStopping the recursion worthsplitting subroutine
if we stop when has zero cost, we may overfitR node
heuristics for stopping the splitting:
reached a desired depthnumber of examples in or is too small is a good approximation, the cost is small enoughreduction in cost by splitting is small
R left R right
w k
cost(R ,D) −node ( cost(R ,D) +N node
N leftleft cost(R ,D))N node
N rightright
image credit: https://alanjeffares.wordpress.com/tutorials/decision-tree/
7 . 3
revisiting therevisiting the classification cost classification cost
ideally we want to optimize the 0-1 loss (misclassification rate)
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w )k
this may not be the optimal cost for each step of greedy heuristic
8 . 1
revisiting therevisiting the classification cost classification cost
ideally we want to optimize the 0-1 loss (misclassification rate)
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w )k
this may not be the optimal cost for each step of greedy heuristic
example
8 . 1
(.5, 100%)
(.25, 50%) (.75, 50%)
R node
R left R right
revisiting therevisiting the classification cost classification cost
ideally we want to optimize the 0-1 loss (misclassification rate)
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w )k
this may not be the optimal cost for each step of greedy heuristic
example
(.5, 100%)
(.33, 75%) (1, 25%)
8 . 1
(.5, 100%)
(.25, 50%) (.75, 50%)
R node
R left R right
revisiting therevisiting the classification cost classification cost
ideally we want to optimize the 0-1 loss (misclassification rate)
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w )k
this may not be the optimal cost for each step of greedy heuristic
example
(.5, 100%)
(.33, 75%) (1, 25%)
8 . 1
both splits have the same misclassification rate (2/8)
(.5, 100%)
(.25, 50%) (.75, 50%)
R node
R left R right
revisiting therevisiting the classification cost classification cost
ideally we want to optimize the 0-1 loss (misclassification rate)
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w )k
this may not be the optimal cost for each step of greedy heuristic
example
(.5, 100%)
(.33, 75%) (1, 25%)
8 . 1
both splits have the same misclassification rate (2/8)
however the second split may be preferable because one region does not need further splitting
(.5, 100%)
(.25, 50%) (.75, 50%)
R node
R left R right
revisiting therevisiting the classification cost classification cost
ideally we want to optimize the 0-1 loss (misclassification rate)
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w )k
this may not be the optimal cost for each step of greedy heuristic
example
(.5, 100%)
(.33, 75%) (1, 25%)
8 . 1
both splits have the same misclassification rate (2/8)
however the second split may be preferable because one region does not need further splitting
(.5, 100%)
(.25, 50%) (.75, 50%)
R node
R left R right
use a measure for homogeneity of labels in regions
EntropyEntropy
entropy is the expected amount of information in observing a random variable
H(y) = − p(y =∑c=1C
c) log p(y = c)
ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)
8 . 2
EntropyEntropy
entropy is the expected amount of information in observing a random variable
H(y) = − p(y =∑c=1C
c) log p(y = c)
is the amount of information in observing c− log p(y = c)
ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)
8 . 2
EntropyEntropy
entropy is the expected amount of information in observing a random variable
H(y) = − p(y =∑c=1C
c) log p(y = c)
is the amount of information in observing c− log p(y = c)
zero information of p(c)=1less probable events are more informativeinformation from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d)
p(c) < p(c ) ⇒′ − log p(c) > − log p(c )′
ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)
8 . 2
EntropyEntropy
entropy is the expected amount of information in observing a random variable
H(y) = − p(y =∑c=1C
c) log p(y = c)
a uniform distribution has the highest entropy H(y) = − log =∑c=1C
C1
C1 log C
is the amount of information in observing c− log p(y = c)
zero information of p(c)=1less probable events are more informativeinformation from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d)
p(c) < p(c ) ⇒′ − log p(c) > − log p(c )′
ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)
8 . 2
EntropyEntropy
entropy is the expected amount of information in observing a random variable
H(y) = − p(y =∑c=1C
c) log p(y = c)
a uniform distribution has the highest entropy H(y) = − log =∑c=1C
C1
C1 log C
a deterministic random variable has the lowest entropy H(y) = −1 log(1) = 0
is the amount of information in observing c− log p(y = c)
zero information of p(c)=1less probable events are more informativeinformation from two independent events is additive − log(p(c)q(d)) = − log p(c) − log q(d)
p(c) < p(c ) ⇒′ − log p(c) > − log p(c )′
ynote that it is common to use capital letters forrandom variables (here for consistency we use lower-case)
8 . 2
Mutual informationMutual information
for two random variables t, y
8 . 3
Mutual informationMutual information
for two random variables t, y
the amount of information t conveys about ychange in the entropy of y after observing the value of t
mutual information is
8 . 3
Mutual informationMutual information
I(t, y) = H(y) − H(y∣t)
for two random variables t, y
the amount of information t conveys about ychange in the entropy of y after observing the value of t
mutual information is
8 . 3
Mutual informationMutual information
I(t, y) = H(y) − H(y∣t)conditional entropy p(t =∑l=1
Ll)H(x∣t = l)
for two random variables t, y
the amount of information t conveys about ychange in the entropy of y after observing the value of t
mutual information is
8 . 3
Mutual informationMutual information
I(t, y) = H(y) − H(y∣t)conditional entropy p(t =∑l=1
Ll)H(x∣t = l)
for two random variables t, y
= p(y =∑l∑c c, t = l) log
p(y=c)p(t=l)p(y=c,t=l)
this is symmetric wrt y and t
the amount of information t conveys about ychange in the entropy of y after observing the value of t
mutual information is
8 . 3
Mutual informationMutual information
I(t, y) = H(y) − H(y∣t)conditional entropy p(t =∑l=1
Ll)H(x∣t = l)
for two random variables t, y
= H(t) − H(t∣y) = I(y, t)
= p(y =∑l∑c c, t = l) log
p(y=c)p(t=l)p(y=c,t=l)
this is symmetric wrt y and t
the amount of information t conveys about ychange in the entropy of y after observing the value of t
mutual information is
8 . 3
Mutual informationMutual information
I(t, y) = H(y) − H(y∣t)conditional entropy p(t =∑l=1
Ll)H(x∣t = l)
it is always positive and zero only if y and t are independent
for two random variables t, y
= H(t) − H(t∣y) = I(y, t)
try to prove these properties
= p(y =∑l∑c c, t = l) log
p(y=c)p(t=l)p(y=c,t=l)
this is symmetric wrt y and t
the amount of information t conveys about ychange in the entropy of y after observing the value of t
mutual information is
8 . 3
Entropy for classification costEntropy for classification cost
we care about the distribution of labels p (y =k c) = N k
I(y =c)∑x ∈R
(n)k
(n)
8 . 4
Entropy for classification costEntropy for classification cost
we care about the distribution of labels p (y =k c) = N k
I(y =c)∑x ∈R
(n)k
(n)
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w ) =k 1 − p (w )k kmisclassification cost
8 . 4
Entropy for classification costEntropy for classification cost
we care about the distribution of labels p (y =k c) = N k
I(y =c)∑x ∈R
(n)k
(n)
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k
8 . 4
Entropy for classification costEntropy for classification cost
we care about the distribution of labels p (y =k c) = N k
I(y =c)∑x ∈R
(n)k
(n)
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k
entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy
8 . 4
Entropy for classification costEntropy for classification cost
we care about the distribution of labels p (y =k c) = N k
I(y =c)∑x ∈R
(n)k
(n)
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k
entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy
change in the cost becomes the mutual information between the test and labels
8 . 4
Entropy for classification costEntropy for classification cost
we care about the distribution of labels p (y =k c) = N k
I(y =c)∑x ∈R
(n)k
(n)
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k
entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy
cost(R ,D) −node ( cost(R ,D) +N node
N leftleft cost(R ,D))
N node
N leftright
change in the cost becomes the mutual information between the test and labels
8 . 4
Entropy for classification costEntropy for classification cost
we care about the distribution of labels p (y =k c) = N k
I(y =c)∑x ∈R
(n)k
(n)
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k
entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy
cost(R ,D) −node ( cost(R ,D) +N node
N leftleft cost(R ,D))
N node
N leftright
change in the cost becomes the mutual information between the test and labels
= H(y) − (p(x ≥d s )H(p(y∣x ≥d,n d s )) +d,n p(x <d s )H(p(y∣x <d,n d s )))d,n
8 . 4
Entropy for classification costEntropy for classification cost
we care about the distribution of labels p (y =k c) = N k
I(y =c)∑x ∈R
(n)k
(n)
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k
entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy
cost(R ,D) −node ( cost(R ,D) +N node
N leftleft cost(R ,D))
N node
N leftright
change in the cost becomes the mutual information between the test and labels
= H(y) − (p(x ≥d s )H(p(y∣x ≥d,n d s )) +d,n p(x <d s )H(p(y∣x <d,n d s )))d,n = I(y,x > s )d,n
8 . 4
Entropy for classification costEntropy for classification cost
we care about the distribution of labels p (y =k c) = N k
I(y =c)∑x ∈R
(n)k
(n)
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w ) =k 1 − p (w )k kmisclassification costthe most probable class w =k arg max p (c)c k
entropy cost cost(R ,D) =k H(y) choose the split with the lowest entropy
cost(R ,D) −node ( cost(R ,D) +N node
N leftleft cost(R ,D))
N node
N leftright
change in the cost becomes the mutual information between the test and labels
= H(y) − (p(x ≥d s )H(p(y∣x ≥d,n d s )) +d,n p(x <d s )H(p(y∣x <d,n d s )))d,n = I(y,x > s )d,n
choosing the test which is maximally informative about labels
8 . 4
Entropy for classification costEntropy for classification cost
example
(.5, 100%)
(.25, 50%) (.75, 50%)
(.5, 100%)
(.33, 75%) (1, 25%)
8 . 5
R node
R left R right
Entropy for classification costEntropy for classification cost
example
(.5, 100%)
(.25, 50%) (.75, 50%)
(.5, 100%)
(.33, 75%) (1, 25%)
8 . 5
R node
R left R right
misclassification cost
⋅84
+41
⋅84
=41
41
Entropy for classification costEntropy for classification cost
example
(.5, 100%)
(.25, 50%) (.75, 50%)
(.5, 100%)
(.33, 75%) (1, 25%)
8 . 5
R node
R left R right
misclassification cost
⋅84
+41
⋅84
=41
41
⋅86
+31
⋅82
=20
41
Entropy for classification costEntropy for classification cost
example
(.5, 100%)
(.25, 50%) (.75, 50%)
(.5, 100%)
(.33, 75%) (1, 25%)
8 . 5
R node
R left R right
misclassification cost
⋅84
+41
⋅84
=41
41
⋅86
+31
⋅82
=20
41
the same costs
Entropy for classification costEntropy for classification cost
example
(.5, 100%)
(.25, 50%) (.75, 50%)
(.5, 100%)
(.33, 75%) (1, 25%)
8 . 5
R node
R left R right
misclassification cost
⋅84
+41
⋅84
=41
41
⋅86
+31
⋅82
=20
41
entropy cost (using base 2 logarithm)
the same costs
Entropy for classification costEntropy for classification cost
example
(.5, 100%)
(.25, 50%) (.75, 50%)
(.5, 100%)
(.33, 75%) (1, 25%)
8 . 5
R node
R left R right
misclassification cost
⋅84
+41
⋅84
=41
41
⋅86
+31
⋅82
=20
41
entropy cost (using base 2 logarithm)
( −84
log( ) −41
41
log( )) +43
43
( −84
log( ) −41
41
log( )) ≈43
43 .81
the same costs
Entropy for classification costEntropy for classification cost
example
(.5, 100%)
(.25, 50%) (.75, 50%)
(.5, 100%)
(.33, 75%) (1, 25%)
8 . 5
R node
R left R right
misclassification cost
⋅84
+41
⋅84
=41
41
⋅86
+31
⋅82
=20
41
entropy cost (using base 2 logarithm)
( −84
log( ) −41
41
log( )) +43
43
( −84
log( ) −41
41
log( )) ≈43
43 .81
( −86
log( ) −31
31
log( )) +32
32
⋅82 0 ≈ .68
lower cost split
the same costs
Gini indexGini index
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w ) =k 1 − p(w )kmisclassification (error) rate
entropy cost(R ,D) =k H(y)
another cost for selecting the test in classification
8 . 6
Gini indexGini index
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w ) =k 1 − p(w )kmisclassification (error) rate
entropy cost(R ,D) =k H(y)
Gini index it is the expected error rate
another cost for selecting the test in classification
8 . 6
Gini indexGini index
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w ) =k 1 − p(w )kmisclassification (error) rate
entropy cost(R ,D) =k H(y)
Gini index it is the expected error rate
cost(R ,D) =k p(c)(1 −∑c=1C
p(c))probability of class c probability of error
another cost for selecting the test in classification
8 . 6
Gini indexGini index
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w ) =k 1 − p(w )kmisclassification (error) rate
entropy cost(R ,D) =k H(y)
= p(c) −∑c=1C
p(c) =∑c=1C 2 1 − p(c)∑c=1
C 2
Gini index it is the expected error rate
cost(R ,D) =k p(c)(1 −∑c=1C
p(c))probability of class c probability of error
another cost for selecting the test in classification
8 . 6
Winter 2020 | Applied Machine Learning (COMP551)
Gini indexGini index
cost(R ,D) =k I(y =N k
1 ∑x ∈R
(n)k
(n) w ) =k 1 − p(w )kmisclassification (error) rate
entropy cost(R ,D) =k H(y)
= p(c) −∑c=1C
p(c) =∑c=1C 2 1 − p(c)∑c=1
C 2
Gini index it is the expected error rate
cost(R ,D) =k p(c)(1 −∑c=1C
p(c))probability of class c probability of error
another cost for selecting the test in classification
comparison of costs of a node when we have 2 classes
p(y = 1)
cost
8 . 6
ExampleExample
decision tree for Iris dataset
decision boundariesdataset (D=2) decision tree
9 . 1
ExampleExample
decision tree for Iris dataset
decision boundariesdataset (D=2)
1
decision tree
9 . 1
ExampleExample
decision tree for Iris dataset
decision boundariesdataset (D=2)
1
2
decision tree
9 . 1
ExampleExample
decision tree for Iris dataset
decision boundariesdataset (D=2)
1
2
3
decision tree
9 . 1
ExampleExample
decision tree for Iris dataset
decision boundariesdataset (D=2)
decision boundaries suggest overfitting
confirmed using a validation set
training accuracy ~ 85%(Cross) validation accuracy ~ 70% 1
2
3
decision tree
9 . 1
OverfittingOverfitting
a decision tree can fit any Boolean function (binary classification with binary features)
image credit: https://www.wikiwand.com/en/Binary_decision_diagram
9 . 2
OverfittingOverfitting
a decision tree can fit any Boolean function (binary classification with binary features)
image credit: https://www.wikiwand.com/en/Binary_decision_diagram
example: of decision tree representation of a boolean function (D=3)
9 . 2
OverfittingOverfitting
a decision tree can fit any Boolean function (binary classification with binary features)
image credit: https://www.wikiwand.com/en/Binary_decision_diagram
example: of decision tree representation of a boolean function (D=3)
there are such functions, why?22D
9 . 2
OverfittingOverfitting
a decision tree can fit any Boolean function (binary classification with binary features)
image credit: https://www.wikiwand.com/en/Binary_decision_diagram
example: of decision tree representation of a boolean function (D=3)
there are such functions, why?22D
large decision trees have a high variance - low bias (low training error, high test error)
9 . 2
OverfittingOverfitting
a decision tree can fit any Boolean function (binary classification with binary features)
image credit: https://www.wikiwand.com/en/Binary_decision_diagram
example: of decision tree representation of a boolean function (D=3)
there are such functions, why?22D
large decision trees have a high variance - low bias (low training error, high test error)
idea 1. grow a small tree
9 . 2
OverfittingOverfitting
a decision tree can fit any Boolean function (binary classification with binary features)
image credit: https://www.wikiwand.com/en/Binary_decision_diagram
example: of decision tree representation of a boolean function (D=3)
there are such functions, why?22D
large decision trees have a high variance - low bias (low training error, high test error)
idea 1. grow a small tree
9 . 2
substantial reduction in cost may happen after a few steps
by stopping early we cannot know this
OverfittingOverfitting
a decision tree can fit any Boolean function (binary classification with binary features)
image credit: https://www.wikiwand.com/en/Binary_decision_diagram
example: of decision tree representation of a boolean function (D=3)
there are such functions, why?22D
large decision trees have a high variance - low bias (low training error, high test error)
idea 1. grow a small tree
example cost drops after the second node
9 . 2
substantial reduction in cost may happen after a few steps
by stopping early we cannot know this
PruningPruning idea 2. grow a large tree and then prune it
9 . 3
PruningPruning idea 2. grow a large tree and then prune it
greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node
9 . 3
PruningPruning idea 2. grow a large tree and then prune it
greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node
pick the best among the above models using using a validation set
9 . 3
PruningPruning idea 2. grow a large tree and then prune it
greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node
pick the best among the above models using using a validation set
example
before pruning
9 . 3
PruningPruning idea 2. grow a large tree and then prune it
greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node
pick the best among the above models using using a validation set
after pruning
example
before pruning
9 . 3
PruningPruning idea 2. grow a large tree and then prune it
greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node
pick the best among the above models using using a validation set
after pruning cross-validation is used to pickthe best size
example
before pruning
9 . 3
Winter 2020 | Applied Machine Learning (COMP551)
PruningPruning idea 2. grow a large tree and then prune it
greedily turn an internal node into a leaf nodechoice is based on the lowest increase in the costrepeat this until left with the root node
pick the best among the above models using using a validation set
after pruning cross-validation is used to pickthe best size
example
before pruning
idea 3. random forests (later!)
9 . 3
SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classification
10
SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classificationoptimization:
NP-harduse greedy heuristic
10
SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classificationoptimization:
NP-harduse greedy heuristic
adjust the cost for the heuristicusing entropy (relation to mutual information maximization)using Gini index
10
SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classificationoptimization:
NP-harduse greedy heuristic
adjust the cost for the heuristicusing entropy (relation to mutual information maximization)using Gini index
decision trees are unstable (have high variance)use pruning to avoid overfitting
10
SummarySummarymodel: divide the input into axis-aligned regionscost: for regression and classificationoptimization:
NP-harduse greedy heuristic
adjust the cost for the heuristicusing entropy (relation to mutual information maximization)using Gini index
decision trees are unstable (have high variance)use pruning to avoid overfitting
there are variations on decision tree heuristicswhat we discussed in called Classification and Regression Trees (CART)
10