Upload
vophuc
View
216
Download
0
Embed Size (px)
Citation preview
Classification: Basic Concepts, Decision
Trees, and Model Evaluation
刘 淇刘 淇School of Computer Science and Technology
USTChtt // t ff t d / ili l/DM2013 ht lhttp://staff.ustc.edu.cn/~qiliuql/DM2013.html
Classification: Definition
• Given a collection of data records (training set)records (training set)
Each record is characterized by a tuple (x, y) where x is the attribute
ID Home Owner
Marital Status
Annual Income
Defaulted Borrower
1 Yes Single 125K No y), where x is the attribute set and y is the class label
x: attribute, predictor, independent variable input
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K Noindependent variable, inputy: class, response, dependent variable, output
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
• Task:Learn a model that maps
h tt ib t t i t
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K Noeach attribute set x into one of the predefined class labels y
10 No Single 90K Yes 10
2
Examples of Classification Task
Task Attribute set, x Class label, yDecide to mail a catalog or not
Demographical information for h h ld
Purchase or no purchase
householdsCustomer churn prediction
Usage data for phone users
Churn or non-churnprediction phone usersDecide to issue credit card or not
Application data Good credit or bad creditcredit card or not credit
Categorizing email messages
Words in the document
spam or non-spamemail messages or webpages
document
3
General Approach for Building Classification Model
Model Training
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
Learn Model
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
General Approach for Building Classification Model
Tid Attrib1 Attrib2 Attrib3 Class
Model Testing
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Learn Model
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes Model8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Apply ModelTid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
5
15 No Large 67K ?10
Performance Evaluation
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a(TP)
b(FN)
CLASS(TP) (FN)
Class=No c(FP)
d(TN)
Most widely-used metric:
( ) ( )
FNFPTNTPTNTP
dcbada
++++
=+++
+=Accuracy
6
Classification Techniques
• Base ClassifiersDecision Tree based MethodsRule-based MethodsNearest-neighborNeural NetworksNaïve Bayes and Bayesian Belief NetworksSupport Vector Machines
• Ensemble ClassifiersBoosting, Bagging, Random Forests
Decision Trees
• Examples and Introduction• Usage of Decision Tree• Usage of Decision Tree• Decision Tree Induction• ……
8
Example of a Decision Tree Built
ID Home Owner
Marital Status
Annual Income
Defaulted Borrower
Splitting Attributes
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
Home Owner
Yes No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
MarStNOMarriedSingle, Divorced
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 N M i d 75K N
Income
YESNO
NO< 80K > 80K
9 No Married 75K No
10 No Single 90K Yes 10
YESNO
Training Data Model: Decision Tree
9
Training Data Model: Decision Tree
Concepts of The Tree Structure
Home O nerOwner
MarStNO
Yes No
Income NO
MarriedSingle, Divorced
YESNO
< 80K > 80K
10
Another Example of Decision Tree Built
MarSt
HNO
MarriedSingle,
DivorcedID Home
OwnerMarital Status
Annual Income
Defaulted Borrower
Home Owner
Income
NO
NO
Yes No1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No Income
YESNO
NO< 80K > 80K
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
Th b b th t th t
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 N M i d 75K N There cab be more than one tree that fits the same data!
9 No Married 75K No
10 No Single 90K Yes 10
Training Data
11
Training Data
Using Decision Tree for Classification
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
Learn Model
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
Apply
9 No Medium 75K No
10 No Small 90K Yes 10
DecisionApply Model
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
Decision Tree
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
12
Applying Model to Test Data
Test DataStart from the root of tree.
Home
Home Owner
MaritalStatus
AnnualIncome
DefaultedBorrower
No Married 80K ? 10 Home
Owner
MarStNO
Yes No
MarStNO
MarriedSingle, Divorced
Income NO
< 80K > 80K
YESNO
13
Apply Model to Test Data
Test Data
Home Owner
MaritalStatus
AnnualIncome
DefaultedBorrower
No Married 80K ? 10 Home
MarStNO
Yes NoHome Owner
MarStNO
MarriedSingle, Divorced
Income NO
< 80K > 80K
YESNO
14
Apply Model to Test Data
Test Data
Home Owner
MaritalStatus
AnnualIncome
DefaultedBorrower
No Married 80K ? 10 Home
MarStNO
Yes NoHome Owner
MarStNO
MarriedSingle, Divorced
Income NO
< 80K > 80K
YESNO
15
Apply Model to Test Data
Test Data
Home Owner
MaritalStatus
AnnualIncome
DefaultedBorrower
No Married 80K ? 10 Home
MarStNO
Yes NoHome Owner
MarStNO
MarriedSingle, Divorced
Income NO
< 80K > 80K
YESNO
16
Apply Model to Test Data
Test Data
Home Owner
MaritalStatus
AnnualIncome
DefaultedBorrower
No Married 80K ? 10 Home
MarStNO
Yes NoHome Owner
MarStNO
Married Single, Divorced
Income NO
< 80K > 80K
YESNO
17
Apply Model to Test Data
Test Data
Home Owner
MaritalStatus
AnnualIncome
DefaultedBorrower
No Married 80K ? 10 Home
MarStNO
Yes NoHome Owner
MarStNO
Married Single, Divorced Assign Defaulted to “No”
Income NO
< 80K > 80K
YESNO
18
How to Build a Decision Tree?
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Learn Model
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Apply Model
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
Decision Tree
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
19
Decision Tree Induction
• How to Build a Decision Tree from a Data TableFamous AlgorithmsFamous Algorithms
Hunt’s Algorithm (one of the earliest)CARTID3, C4.5SLIQ, SPRINT
20
General Structure of Hunt’s Algorithm
Let Dt be the set of training records that reach a node t
ID Home Owner
Marital Status
Annual Income
Defaulted Borrower
1 Yes Single 125K No
records that reach a node tIf Dt contains records that belong the same class yt,
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
belong the same class yt, then t is a leaf node labeled as yt
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
If Dt contains records that belong to more than one class, use an attribute test to
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
class, use an attribute test to split the data into smaller subsets.
10 No Single 90K Yes 10
Recursively apply the procedure to each subset.
21
Hunt’s Algorithm
ID Home Owner
Marital Status
Annual Income
Defaulted Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
22
Design Issues of Decision Tree Induction
• How should training records be split?Method for specifying test conditionMethod for specifying test condition
depending on attribute types
Measure for evaluating the goodness of a test conditionMeasure for evaluating the goodness of a test condition
• How should the splitting procedure stop?• How should the splitting procedure stop?Stop splitting if all the records belong to the same class or all the records have identical attribute valuesor all the records have identical attribute valuesEarly termination
23
Methods for Expressing Test Conditions
• Depends on attribute typesBinary(二元)Binary(二元)
Nominal(标称)
Ordinal(有序)Ordinal(有序)
Continuous(连续)
• Depends on number of ways to split2 lit2-way splitMulti-way split
24
Test Condition for Nominal Attributes
• Multi-way split:Use as many partitions as y pdistinct values.
• Binary split:y pDivides values into two subsets at a timeNeed to find optimal partitioning.
25
Test Condition for Ordinal Attributes
• Multi-way split:Use as many partitions as y pdistinct values
• Binary split:Divides values into two subsetsNeed to find optimal partitioningPreserve the order property among attribute values This grouping
violates order property
26
Test Condition for Continuous Attributes
27
Splitting Based on Continuous Attributes
• Different ways of handlingDiscretization to form an ordinal categorical attributeDiscretization to form an ordinal categorical attribute
Static – discretize once at the beginningDynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing(percentiles), or clustering.
Binary Decision: (A < v) or (A ≥ v)consider all possible splits and finds the best cutconsider all possible splits and finds the best cutcan be more compute-intensive
28
How to determine the Best Split
Before Splitting: 10 records of class 0,10 records of class 110 records of class 1
Which test condition is the best?
29
Which test condition is the best?
How to determine the Best Split
• Greedy approach: Nodes with purer class distribution are preferredNodes with purer class distribution are preferred
• Need a measure of node impurity:Need a measure of node impurity:
High degree of impurity Low degree of impurity
30
Measures of Node Impurity
• Gini Index∑= tjptGINI 2)]|([1)( ∑−=
jtjptGINI )]|([1)(
• Entropy∑−= tjptjptEntropy )|(log)|()(
• Misclassification error
∑j
jpjppy )|(g)|()(
• Misclassification error
)|(max1)( tiPtError )|(max1)( tiPtErrori
−=
31
Finding the Best Split
1. Compute impurity measure (P) before splitting2 Compute impurity measure (M) after splitting2. Compute impurity measure (M) after splitting
Compute impurity measure of each child nodeCompute the average impurity of the children (M)Compute the average impurity of the children (M)
3. Choose the attribute test condition that produces the highest gainthe highest gain
Gain = P – M
or equivalently, lowest impurity measure after splitting (M)splitting (M)
32
Finding the Best Split
Before Splitting: C0 N00 C1 N01 P
B?
Yes No
A?
Yes No
C 0
Yes No
Node N3 Node N4
Yes No
Node N1 Node N2
C0 N10 C1 N11
C0 N20 C1 N21
C0 N30 C1 N31
C0 N40 C1 N41
M11 M12 M21 M22
M1 M2
33
Gain = P – M1 vs P – M2
Measure of Impurity: GINI
• Gini Index for a given node t :
∑ tjtGINI 2)]|([1)(
(NOTE: p( j | t) is the relative frequency of class j at node t)
∑−=j
tjptGINI 2)]|([1)(
(NOTE: p( j | t) is the relative frequency of class j at node t).
Maximum (1 - 1/nc) when records are equally distributed among all classes implying least interesting informationamong all classes, implying least interesting informationMinimum (0.0) when all records belong to one class, implying most interesting informationp y g g
C1 0 C1 2 C1 3C1 1C1 0C2 6
Gini=0.000
C1 2C2 4
Gini=0.444
C1 3C2 3
Gini=0.500
C1 1C2 5
Gini=0.278
34
Computing Gini Index of a Single Node
∑−= tjptGINI 2)]|([1)(
C1 0
j
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1C1 0 C2 6
P(C1) 0/6 0 P(C2) 6/6 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0
C1 1 C2 5
P(C1) = 1/6 P(C2) = 5/6C2 5
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
C1 2 C2 4
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
35
Computing Gini Index for a Collection of Nodes• When a node p is split into k partitions (children)
k
∑=
=k
i
isplit iGINI
nnGINI
1)(
where, ni = number of records at child i,n = number of records at parent node pn number of records at parent node p.
• Choose the attribute that minimizes weighted average GiniChoose the attribute that minimizes weighted average Gini index of the children
• Gini index is used in decision tree algorithms such as CART, SLIQ, SPRINT
36
Binary Attributes: Computing GINI Index• Splits into two partitions• Effect of Weighing partitions:• Effect of Weighing partitions:
Larger and Purer Partitions are sought for.
B? Parent
C1 6
Yes No
Node N1 Node N2
C2 6Gini = 0.500
Node N1 Node N2
N1 N2C1 5 2
Gini(N1) = 1 – (5/6)2 – (1/6)2
= 0.278 Gini(Children) = 6/12 * 0 278 +C1 5 2
C2 1 4 Gini=0.361
Gini(N2) = 1 – (2/6)2 – (4/6)2
0 444
= 6/12 * 0.278 + 6/12 * 0.444
= 0.361
37
= 0.444
Categorical Attributes: Computing Gini Index• For each distinct value, gather counts for each
class in the datasetclass in the dataset• Use the count matrix to make decisions
Multi-way split Two-way split (find best partition of values)
CarType {Sports, {Family}
CarType
{Sports} {Family, CarType
Family Sports Luxury
(find best partition of values)
{ pLuxury} {Family}
C1 9 1 C2 7 3
Gini 0 468
{Sports} { yLuxury}
C1 8 2 C2 0 10
Gini 0 167
Family Sports LuxuryC1 1 8 1 C2 3 0 7
Gini 0 163 Gini 0.468
Gini 0.167
Gini 0.163
38
Continuous Attributes: Computing Gini Index• Use Binary Decisions based on one
value• Several Choices for the splitting value
ID Home Owner
Marital Status
Annual Income Defaulted
• Several Choices for the splitting valueNumber of possible splitting values = Number of distinct values
E h litti l h t t i
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
• Each splitting value has a count matrix associated with it
Class counts in each of the
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Y Di d 220K Npartitions, A < v and A ≥ v• Simple method to choose best v
For each v, scan the database to
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 N Si l 90K YFor each v, scan the database to gather count matrix and compute its Gini indexComputationally Inefficient!
10 No Single 90K Yes10
Computationally Inefficient! Repetition of work.
39
Continuous Attributes: Computing Gini Index...• For efficient computation: for each attribute,
Sort the attribute on valuesLi l th l h ti d ti th t t i dLinearly scan these values, each time updating the count matrix and computing gini indexChoose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
< > < > < > < > < > < > < > < > < > < > < >
Split PositionsSorted Values
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
40
Measure of Impurity: Entropy
• Entropy at a given node t:
jjE )|(l)|()(
(NOTE ( j | t) i th l ti f f l j t d t)
∑−=j
tjptjptEntropy )|(log)|()(
(NOTE: p( j | t) is the relative frequency of class j at node t).
Maximum (log nc) when records are equally distributed among all ( g c) q y gclasses implying least informationMinimum (0.0) when all records belong to one class, implying most information
Entropy based computations are quite similar to the GINI index computations
41
Computing Entropy of a Single Node
∑−=j
tjptjptEntropy )|(log)|()(2
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
j
C1 0 C2 6
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
C1 1 C2 5
P(C1) = 1/6 P(C2) = 5/6
C2 5
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
C1 2 C2 4
P(C1) = 2/6 P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
42
Computing Information Gain After Splittingp g• Information Gain:
⎞⎛ n⎟⎠⎞
⎜⎝⎛−= ∑
=
k
i
i
splitiEntropy
nnpEntropyGAIN
1)()(
Parent Node, p is split into k partitions;ni is number of records in partition i
Choose the split that achieves most reduction (maximizes GAIN)GAIN)
Used in the ID3 and C4.5 decision tree algorithms
43
Problems with Information Gain
• Info Gain tends to prefer splits that result in large number of partitions, each being small but purenumber of partitions, each being small but pure
Customer ID has highest information gain becauseCustomer ID has highest information gain because entropy for all the children is zero
44
Gain Ratio
• Gain Ratio:
GAINSplitINFO
GAINGainRATIO Split
split= ∑
=−=
k
i
ii
nn
nnSplitINFO
1log
Parent Node, p is split into k partitionsni is the number of records in partition i
Adjusts Information Gain by the entropy of the partitioning (SplitINFO).
Higher entropy partitioning (large number of small partitions) isHigher entropy partitioning (large number of small partitions) is penalized!
Used in C4.5 algorithmDesigned to overcome the disadvantage of Information Gain
45
Measure of Impurity: Classification Error• Classification error at a node t :
)|(max1)( tiPtErrori
−=
Maximum (1 - 1/nc) when records are equally distributed ll l i l i l t i t ti i f ti
i
among all classes, implying least interesting informationMinimum (0) when all records belong to one class, implying most interesting informationimplying most interesting information
46
Computing Error of a Single Node
)|(max1)( tiPtErrori
−=
C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
)|()(i
C1 0 C2 6
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
C1 1 C2 5
P(C1) = 1/6 P(C2) = 5/6
C2 5
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
C1 2 C2 4
P(C1) = 2/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
47
Comparison among Impurity MeasuresFor a 2-class problem:
48不同的不纯性度量是一致的。但是,作为测试条件的属性选择仍然因不纯性度量的选择而异。
表示属于其中一个类的记录所占的比例
Misclassification Error vs Gini Index
A? Parent
Yes NoC1 7
C2 3 Gi i 0 42Node N1 Node N2 Gini = 0.42
N1 N2 C1 3 4
Gini(N1) = 1 – (3/3)2 – (0/3)2 Gini(Children)
= 3/10 * 0C1 3 4C2 0 3 Gini=0.342
= 0
Gini(N2) = 1 (4/7)2 (3/7)2
= 3/10 * 0 + 7/10 * 0.489= 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves but error remains the same!!
49
same!!
Tree Induction
• Greedy strategy.Split the records based on an attribute test that optimizesSplit the records based on an attribute test that optimizes certain criterion.
• IssuesD t i h t lit th dDetermine how to split the records
How to specify the attribute test condition?How to determine the best split?How to determine the best split?
Determine when to stop splitting
50
Stopping Criteria for Tree Induction
• Stop expanding a node when all the records belong to the same classto the same class
St di d h ll th d h• Stop expanding a node when all the records have similar attribute values
• Early termination (to be discussed later)y ( )
51
Decision Tree Based Classification
• Advantages:Inexpensive to constructInexpensive to constructExtremely fast at classifying unknown recordsEasy to interpret for small-sized treesEasy to interpret for small-sized treesAccuracy is comparable to other classification techniques for many simple data setsfor many simple data sets
52
Example: C4.5
• Simple depth-first construction.• Uses Information Gain• Uses Information Gain• Sorts Continuous Attributes at each node.• Needs entire data to fit in memory.• Unsuitable for Large Datasets.g
Needs out-of-core sorting.
• You can download the software from:http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gzp q g
53
Classification: Model Overfitting andClassification: Model Overfitting and Classifier Evaluation
Classification Errors
• Training errors (apparent errors)Errors committed on the training setErrors committed on the training set
• Test errors• Test errorsErrors committed on the test set
• Generalization errorsExpected error of a model over random selection of records from same distribution(未知记录上的期望误差)
Example Data Set
Two class problem:
+, o
3000 data points (30% for training 70% for testing)training, 70% for testing)
Data set for + class is generated from a uniform distribution
Data set for o class is generated from a mixture of 3 gaussian gdistributions, centered at (5,15), (10,5), and (15,15)
Decision Trees
Decision Tree with 11 leaf nodes Decision Tree with 24 leaf nodesDecision Tree with 11 leaf nodes Decision Tree with 24 leaf nodes
Which tree is better?Which tree is better?
Model Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is largeOverfitting: when model is too complex, training error is small but test error is large
Overfitting due to Noise
Decision boundary is distorted by noise point
59
y y p
Overfitting due to Insufficient Examplesp
Lack of data points in the lower half of the diagram makes it difficult t di t tl th l l b l f th t ito predict correctly the class labels of that region
- Insufficient number of training records in the region causes the decision tree to predict the test examples using other training
60
g grecords that are irrelevant to the classification task
Mammal Classification Problem
Training Set
Decision Tree Model
training error = 0%training error = 0%
Effect of Noise (Data is wrong)
Training Set:
Example: Mammal Classification problemModel M1:
t i 0%Body Temperature
Training Set: train err = 0%,
test err = 30%
Warm-blooded Cold-blooded
Give BirthNon-
mammals
Test Set:
Yes No
M l Non- Model M2:Mammals mammals train err = 20%,
test err = 10%
Lack of Representative Samples
Training Set:
Test Set:
Model M3:
train err = 0%,
test err = 30%
Lack of training records at the leaf nodes for making reliable classificationclassification
Effect of Multiple Comparison Procedure
• Consider the task of predicting whether stock market will rise/fall in the next 10
Day 1 Upstock market will rise/fall in the next 10 trading days
Day 2 DownDay 3 Down
• Random guessing:P(correct) = 0.5
Day 4 UpDay 5 DownD 6 D( )
• Make 10 random guesses in a row:
Day 6 DownDay 7 UpDay 8 UpMake 10 random guesses in a row: Day 8 UpDay 9 UpDay 10 Down101010
⎟⎟⎞
⎜⎜⎛
+⎟⎟⎞
⎜⎜⎛
+⎟⎟⎞
⎜⎜⎛ Day 10 Down
0547.02
1098)8(# 10 =
⎟⎟⎠
⎜⎜⎝
+⎟⎟⎠
⎜⎜⎝
+⎟⎟⎠
⎜⎜⎝=≥correctP
Effect of Multiple Comparison Procedure
• Approach:Get 50 analystsEach analyst makes 10 random guessesChoose the analyst that makes the most number of correct predictions
• Probability that at least one analyst makes at least 8 correct predictions
9399.0)0547.01(1)8(# 50 =−−=≥correctP
Effect of Multiple Comparison Procedure• Many algorithms employ the following greedy strategy:
Initial model: MAlternative model: M’ = M ∪ γ, where γ is a component to be added to the model (e.g., a test condition of a decision tree)test condition of a decision tree)Keep M’ if improvement, Δ(M,M’) > α
• Often times, γ is chosen from a set of alternative components, Γ = {γ1, γ2, …, γk}p {γ1 γ2 γk}
• If many alternatives are available, one may inadvertently add irrelevant components to the model, resulting in model overfitting
Notes on Overfitting
• Overfitting results in decision trees that are more complex than necessarycomplex than necessary
T i i l id d ti t• Training error no longer provides a good estimate of how well the tree will perform on previously
dunseen records
• Need new ways for estimating generalization errors
Incorporating Model Complexity
• Rationale: Occam’s RazorGiven two models of similar generalization errors oneGiven two models of similar generalization errors, one should prefer the simpler model over the more complex model
A complex model has a greater chance of being fitted accidentally by errors in data
Th f h ld i l d d l l it hTherefore, one should include model complexity when evaluating a model
Minimum Description Length (MDL)
A?
B?0
Yes NoX yX1 1
X yX
A B
B?
C?
0
1
B1 B2
C1 C2
1 1X2 0X3 0X4 1
X1 ?X2 ?X3 ?
10X4 1… …Xn 1
X4 ?… …Xn ?
• Cost(Model,Data) = Cost(Data|Model) + Cost(Model)Cost is the number of bits needed for encoding
n ?
Cost is the number of bits needed for encoding.Search for the least costly model.
• Cost(Data|Model) encodes the misclassification errors• Cost(Data|Model) encodes the misclassification errors.• Cost(Model) uses node encoding (number of children)
plus splitting condition encoding.p p g g
估计统计上界
• 使用训练误差的统计修正来估计泛化误差,因为泛化误差通常较大,因此统计修正经常是计算训练误差的误差通常较大,因此统计修正经常是计算训练误差的上界
• 标记:• 标记:e 节点的错误率
x 真实的错误率
keN
=x 真实的错误率
N 训练样本总数
k 被错分的样本总数
N
k 被错分的样本总数
α 是置信水平
是标准正态分布的标准化值z 是标准正态分布的标准化值/2zα
70
估计统计上界
• N个样本,有k 个样本错分的概率符合二项式分布
态 布 似 项 布
( , ) (1 )k k N kNp k N C x x −= −
• 用正态分布可以近似二项分布
Nxμ2 (1 )
NxNx x
μ
σ
=
= −( , (1 ))k N Nx Nx x−
71
估计统计上界
• 用正态分布可以近似二项分布
( , (1 ))k N Nx Nx x−( , (1 ))
(0,1)(1 )
k N Nx Nx xk Nx NNx x−
/2
(1 )Nx xk Nx zα
−
−≤ /2(1 )Nx x
Ne Nx z
α−
−≤ /2(1 )
zNx x α≤
−
72
估计统计上界
/2Ne Nx z−
≤ /2
2 2 2 2/2 /2
(1 )
( ) (2 ) 0
zNx x
N z x Ne z x Ne
α
α α
≤−
+ − + + ≤/2 /2( ) ( )α α
• 解方程有:2 2
/2 /2(1 )z ze ee α α−+ + +/2 /2
/2 2
2/2
( )2 4( , , )
1upper
e zN N Ne N e
z
α αα
α
α+ + +
=+ /21
Nα+
73
Using Validation Set(确认集)
• Divide training data into two parts:Training set:Training set:
use for model building
Validation set:Validation set: use for estimating generalization errorNote: validation set is not the same as test set
• Drawback:Less data available for training
Handling Overfitting in Decision Tree
• Pre-Pruning (Early Stopping Rule)St th l ith b f it b f ll tStop the algorithm before it becomes a fully-grown treeTypical stopping conditions for a node:
St if ll i t b l t th lStop if all instances belong to the same classStop if all the attribute values are the same
More restrictive conditions:More restrictive conditions:Stop if number of instances is less than some user-specified thresholdStop if class distribution of instances are independent of the available features (e.g., using χ 2 test)Stop if expanding the current node does not improve impurityStop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).Stop if estimated generalization error falls below certain threshold
Handling Overfitting in Decision Tree
• Post-pruningGrow decision tree to its entiretyGrow decision tree to its entiretySubtree replacement
Trim the nodes of the decision tree in a bottom-up fashionTrim the nodes of the decision tree in a bottom up fashionIf generalization error improves after trimming, replace sub-tree by a leaf node Class label of leaf node is determined from majority class of instances in the sub-tree
Subtree raisingSubtree raisingReplace subtree with most frequently used branch
Example of Post-Pruning
Class = Yes 20
Training Error (Before splitting) = 10/30
Pessimistic error = (10 + 0 5)/30 = 10 5/30Class = Yes 20
Class = No 10
Error = 10/30
Pessimistic error (10 + 0.5)/30 10.5/30
Training Error (After splitting) = 9/30
Pessimistic error (After splitting)
A?
Error = 10/30 ess st c e o ( te sp tt g)
= (9 + 4 × 0.5)/30 = 11/30
PRUNE!A?
A1
A2 A3
A4
Class = Yes 8Class = No 4
Class = Yes 3Class = No 4
Class = Yes 4Class = No 1
Class = Yes 5Class = No 1Class No 4 Class No 4 Class No 1 Class No 1
Examples of Post-pruning
Handling Missing Attribute Values(缺失值)
• Missing values affect decision tree construction in three different ways:three different ways:
Affects how impurity measures are computedAffects how to distribute instance with missing value toAffects how to distribute instance with missing value to child nodesAffects how a test instance with missing value isAffects how a test instance with missing value is classified
Computing Impurity Measure(忽略)
Tid Refund Marital Status
Taxable Income Class
Before Splitting:Entropy(Parent)
1 Yes Single 125K No
2 No Married 100K No Class= Yes
Class = No
= -0.3 log(0.3)-(0.7)log(0.7) = 0.8813
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
= Yes = NoRefund=Yes 0 3 Refund=No 2 4
Ref nd ? 1 06 No Married 60K No
7 Yes Divorced 220K No
Refund=? 1 0
Split on Refund:
E t (R f d Y ) 08 No Single 85K Yes
9 No Married 75K No
10 ? Single 90K Yes
Entropy(Refund=Yes) = 0
Entropy(Refund=No) = -(2/6)log(2/6) – (4/6)log(4/6) = 0.918310 ? Single 90K Yes
10
( ) g( ) ( ) g( )
Entropy(Children) = 0.3 (0) + 0.6 (0.9183) = 0.551Missing
valueGain = 0.9 × (0.8813 – 0.551) = 0.3303
value
Distribute Instances(根据分布预测)
Tid Refund Marital Status
Taxable Income Class
Tid Refund Marital Taxable1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
Tid Refund MaritalStatus
TaxableIncome Class
10 ? Single 90K Yes 10 g
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
RefundYes No
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 N M i d 75K N
Class=Yes 2 + 6/9
Class=No 4
Class=Yes 0 + 3/9
Class=No 3
9 No Married 75K No10
RefundYes No
Probability that Refund=Yes is 3/9
Probability that Refund=No is 6/9
Class=Yes 0
Class=No 3
Cheat=Yes 2
Cheat=No 4
Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9
g
Classify Instances
Married Single Divorced TotalNew record:
Class=No 3 1 0 4
Cl Y 6/9 1 1 2 67
Tid Refund Marital Status
TaxableIncome Class
11 No ? 85K ?
Refund
Class=Yes 6/9 1 1 2.67
Total 3.67 2 1 6.67
10
MarStNO
Yes No
M i dSingle,
TaxInc NO
Marriedg ,Divorced
< 80K > 80K
Probability that Marital Status = Married is 3.67/6.67
Probability that Marital Status
YESNO
< 80K > 80K Probability that Marital Status ={Single,Divorced} is 3/6.67
Other Issues
• Data Fragmentation(数据碎片)
• Search Strategy(搜索策略)• Search Strategy(搜索策略)
• Expressiveness(可表达性)
复• Tree Replication(重复)
Data Fragmentation
• Number of instances gets smaller as you traverse down the treedown the tree
N b f i t t th l f d ld b t• Number of instances at the leaf nodes could be too small to make any statistically significant decision
Search Strategy
• Finding an optimal decision tree is NP-hard
• The algorithm presented so far uses a greedy, top-down recursive partitioning strategy to induce adown, recursive partitioning strategy to induce a reasonable solution
• Other strategies?Bottom upBottom-upBi-directional
Expressiveness
• Decision tree provides expressive representation for learning discrete-valued function
But they do not generalize well to certain types of Boolean functions
Example: parity function(奇偶函数):Example: parity function(奇偶函数): – Class = 1 if there is an even number of Boolean attributes with truth
value = TrueClass = 0 if there is an odd number of Boolean attributes with truth– Class = 0 if there is an odd number of Boolean attributes with truth value = True
For accurate modeling, must have a complete tree
• Not expressive enough for modeling continuous variablesParticularly when test condition involves only a singleParticularly when test condition involves only a single attribute at-a-time
Decision Boundary
• Border line between two neighboring regions of different classes isBorder line between two neighboring regions of different classes is known as decision boundary
• Decision boundary is parallel to axes because test condition involves a single attribute at a timea single attribute at-a-time
Oblique(斜) Decision Trees
x + y < 1
Class = + Class =
• Test condition may involve multiple attributes
• More expressive representation
Fi di ti l t t diti i t ti ll i• Finding optimal test condition is computationally expensive
Tree Replication
P
Q R
S 0 1Q
0 1 S 0
0 1
• Same subtree appears in multiple branches• Same subtree appears in multiple branches