Classification: Basic Concepts, Decision Trees, and …staff.ustc.edu.cn/~qiliuql/files/DM2013/slide3.2[ch4].pdf · Classification: Basic Concepts, Decision Trees, and Model Evaluation

Classification: Basic Concepts, Decision

Trees, and Model Evaluation

刘淇刘淇School of Computer Science and Technology

USTChtt // t ff t d / ili l/DM2013 ht lhttp://staff.ustc.edu.cn/~qiliuql/DM2013.html

Classification: Definition

• Given a collection of data records (training set)records (training set)

Each record is characterized by a tuple (x, y) where x is the attribute

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower

1 Yes Single 125K No y), where x is the attribute set and y is the class label

x: attribute, predictor, independent variable input

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K Noindependent variable, inputy: class, response, dependent variable, output

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

• Task:Learn a model that maps

h tt ib t t i t

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K Noeach attribute set x into one of the predefined class labels y

10 No Single 90K Yes 10

2

Examples of Classification Task

Task Attribute set, x Class label, yDecide to mail a catalog or not

Demographical information for h h ld

Purchase or no purchase

householdsCustomer churn prediction

Usage data for phone users

Churn or non-churnprediction phone usersDecide to issue credit card or not

Application data Good credit or bad creditcredit card or not credit

Categorizing email messages

Words in the document

spam or non-spamemail messages or webpages

document

3

General Approach for Building Classification Model

Model Training

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

Learn Model

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

General Approach for Building Classification Model


Model Testing

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No


Learn Model

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes Model8 No Small 85K Yes

9 No Medium 75K No


Apply ModelTid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ?

5

15 No Large 67K ?10

Performance Evaluation

PREDICTED CLASS

Class=Yes Class=No

ACTUAL Class=Yes a(TP)

b(FN)

CLASS(TP) (FN)

Class=No c(FP)

d(TN)

Most widely-used metric:

( ) ( )

FNFPTNTPTNTP

dcbada

++++

=+++

+=Accuracy

6

Classification Techniques

• Base ClassifiersDecision Tree based MethodsRule-based MethodsNearest-neighborNeural NetworksNaïve Bayes and Bayesian Belief NetworksSupport Vector Machines

• Ensemble ClassifiersBoosting, Bagging, Random Forests

Decision Trees

• Examples and Introduction• Usage of Decision Tree• Usage of Decision Tree• Decision Tree Induction• ……

8

Example of a Decision Tree Built

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower

Splitting Attributes

1 Yes Single 125K No


3 No Single 70K No

Home Owner

Yes No



6 No Married 60K No

MarStNOMarriedSingle, Divorced

6 No Married 60K No


8 No Single 85K Yes

9 N M i d 75K N

Income

YESNO

NO< 80K > 80K

9 No Married 75K No


YESNO

Training Data Model: Decision Tree

9

Training Data Model: Decision Tree

Concepts of The Tree Structure

Home O nerOwner

MarStNO

Yes No

Income NO

MarriedSingle, Divorced

YESNO

< 80K > 80K

10

Another Example of Decision Tree Built

MarSt

HNO

MarriedSingle,

DivorcedID Home

OwnerMarital Status

Annual Income

Defaulted Borrower

Home Owner

Income

NO

NO

Yes No1 Yes Single 125K No


3 No Single 70K No Income

YESNO

NO< 80K > 80K



6 No Married 60K No

Th b b th t th t

6 No Married 60K No


8 No Single 85K Yes

9 N M i d 75K N There cab be more than one tree that fits the same data!

9 No Married 75K No


Training Data

11

Training Data

Using Decision Tree for Classification


1 Yes Large 125K No1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No


5 No Large 95K Yes

Learn Model

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

Apply

9 No Medium 75K No


DecisionApply Model


11 No Small 55K ?

12 Yes Medium 80K ?

Decision Tree

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

12

Applying Model to Test Data

Test DataStart from the root of tree.

Home

Home Owner

MaritalStatus

AnnualIncome

DefaultedBorrower

No Married 80K ? 10 Home

Owner

MarStNO

Yes No

MarStNO


Income NO

< 80K > 80K

YESNO

13

Apply Model to Test Data

Test Data

Home Owner

MaritalStatus

AnnualIncome

DefaultedBorrower


MarStNO

Yes NoHome Owner

MarStNO


Income NO

< 80K > 80K

YESNO

14


Test Data

Home Owner

MaritalStatus

AnnualIncome

DefaultedBorrower


MarStNO

Yes NoHome Owner

MarStNO


Income NO

< 80K > 80K

YESNO

15


Test Data

Home Owner

MaritalStatus

AnnualIncome

DefaultedBorrower


MarStNO

Yes NoHome Owner

MarStNO


Income NO

< 80K > 80K

YESNO

16


Test Data

Home Owner

MaritalStatus

AnnualIncome

DefaultedBorrower


MarStNO

Yes NoHome Owner

MarStNO

Married Single, Divorced

Income NO

< 80K > 80K

YESNO

17


Test Data

Home Owner

MaritalStatus

AnnualIncome

DefaultedBorrower


MarStNO

Yes NoHome Owner

MarStNO

Married Single, Divorced Assign Defaulted to “No”

Income NO

< 80K > 80K

YESNO

18

How to Build a Decision Tree?


1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No


Learn Model

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No


Apply Model


11 No Small 55K ?

12 Yes Medium 80K ?

Decision Tree

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

19

Decision Tree Induction

• How to Build a Decision Tree from a Data TableFamous AlgorithmsFamous Algorithms

Hunt’s Algorithm (one of the earliest)CARTID3, C4.5SLIQ, SPRINT

20

General Structure of Hunt’s Algorithm

Let Dt be the set of training records that reach a node t

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower


records that reach a node tIf Dt contains records that belong the same class yt,


3 No Single 70K No


belong the same class yt, then t is a leaf node labeled as yt



6 No Married 60K No

If Dt contains records that belong to more than one class, use an attribute test to


8 No Single 85K Yes

9 No Married 75K No

class, use an attribute test to split the data into smaller subsets.


Recursively apply the procedure to each subset.

21

Hunt’s Algorithm

ID Home Owner

Marital Status

Annual Income

Defaulted Borrower



3 No Single 70K No

4 Yes Married 120K No4 Yes Married 120K No


6 No Married 60K No


8 No Single 85K Yes

9 No Married 75K No


22

Design Issues of Decision Tree Induction

• How should training records be split?Method for specifying test conditionMethod for specifying test condition

depending on attribute types

Measure for evaluating the goodness of a test conditionMeasure for evaluating the goodness of a test condition

• How should the splitting procedure stop?• How should the splitting procedure stop?Stop splitting if all the records belong to the same class or all the records have identical attribute valuesor all the records have identical attribute valuesEarly termination

23

Methods for Expressing Test Conditions

• Depends on attribute typesBinary（二元）Binary（二元）

Nominal（标称）

Ordinal（有序）Ordinal（有序）

Continuous（连续）

• Depends on number of ways to split2 lit2-way splitMulti-way split

24

Test Condition for Nominal Attributes

• Multi-way split:Use as many partitions as y pdistinct values.

• Binary split:y pDivides values into two subsets at a timeNeed to find optimal partitioning.

25

Test Condition for Ordinal Attributes

• Multi-way split:Use as many partitions as y pdistinct values

• Binary split:Divides values into two subsetsNeed to find optimal partitioningPreserve the order property among attribute values This grouping

violates order property

26

Test Condition for Continuous Attributes

27

Splitting Based on Continuous Attributes

• Different ways of handlingDiscretization to form an ordinal categorical attributeDiscretization to form an ordinal categorical attribute

Static – discretize once at the beginningDynamic – ranges can be found by equal interval

bucketing, equal frequency bucketing(percentiles), or clustering.

Binary Decision: (A < v) or (A ≥ v)consider all possible splits and finds the best cutconsider all possible splits and finds the best cutcan be more compute-intensive

28

How to determine the Best Split

Before Splitting: 10 records of class 0,10 records of class 110 records of class 1

Which test condition is the best?

29

Which test condition is the best?

How to determine the Best Split

• Greedy approach: Nodes with purer class distribution are preferredNodes with purer class distribution are preferred

• Need a measure of node impurity:Need a measure of node impurity:

High degree of impurity Low degree of impurity

30

Measures of Node Impurity

• Gini Index∑= tjptGINI 2)]|([1)( ∑−=

jtjptGINI )]|([1)(

• Entropy∑−= tjptjptEntropy )|(log)|()(

• Misclassification error

∑j

jpjppy )|(g)|()(

• Misclassification error

)|(max1)( tiPtError )|(max1)( tiPtErrori

−=

31

Finding the Best Split

1. Compute impurity measure (P) before splitting2 Compute impurity measure (M) after splitting2. Compute impurity measure (M) after splitting

Compute impurity measure of each child nodeCompute the average impurity of the children (M)Compute the average impurity of the children (M)

3. Choose the attribute test condition that produces the highest gainthe highest gain

Gain = P – M

or equivalently, lowest impurity measure after splitting (M)splitting (M)

32

Finding the Best Split

Before Splitting: C0 N00 C1 N01 P

B?

Yes No

A?

Yes No

C 0

Yes No

Node N3 Node N4

Yes No

Node N1 Node N2

C0 N10 C1 N11

C0 N20 C1 N21

C0 N30 C1 N31

C0 N40 C1 N41

M11 M12 M21 M22

M1 M2

33

Gain = P – M1 vs P – M2

Measure of Impurity: GINI

• Gini Index for a given node t :

∑ tjtGINI 2)]|([1)(

(NOTE: p( j | t) is the relative frequency of class j at node t)

∑−=j

tjptGINI 2)]|([1)(

(NOTE: p( j | t) is the relative frequency of class j at node t).

Maximum (1 - 1/nc) when records are equally distributed among all classes implying least interesting informationamong all classes, implying least interesting informationMinimum (0.0) when all records belong to one class, implying most interesting informationp y g g

C1 0 C1 2 C1 3C1 1C1 0C2 6

Gini=0.000

C1 2C2 4

Gini=0.444

C1 3C2 3

Gini=0.500

C1 1C2 5

Gini=0.278

34

Computing Gini Index of a Single Node

∑−= tjptGINI 2)]|([1)(

C1 0

j

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1C1 0 C2 6

P(C1) 0/6 0 P(C2) 6/6 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 C2 5

P(C1) = 1/6 P(C2) = 5/6C2 5

Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 C2 4

P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444

35

Computing Gini Index for a Collection of Nodes• When a node p is split into k partitions (children)

k

∑=

=k

i

isplit iGINI

nnGINI

1)(

where, ni = number of records at child i,n = number of records at parent node pn number of records at parent node p.

• Choose the attribute that minimizes weighted average GiniChoose the attribute that minimizes weighted average Gini index of the children

• Gini index is used in decision tree algorithms such as CART, SLIQ, SPRINT

36

Binary Attributes: Computing GINI Index• Splits into two partitions• Effect of Weighing partitions:• Effect of Weighing partitions:

Larger and Purer Partitions are sought for.

B? Parent

C1 6

Yes No

Node N1 Node N2

C2 6Gini = 0.500

Node N1 Node N2

N1 N2C1 5 2

Gini(N1) = 1 – (5/6)2 – (1/6)2

= 0.278 Gini(Children) = 6/12 * 0 278 +C1 5 2

C2 1 4 Gini=0.361

Gini(N2) = 1 – (2/6)2 – (4/6)2

0 444

= 6/12 * 0.278 + 6/12 * 0.444

= 0.361

37

= 0.444

Categorical Attributes: Computing Gini Index• For each distinct value, gather counts for each

class in the datasetclass in the dataset• Use the count matrix to make decisions

Multi-way split Two-way split (find best partition of values)

CarType {Sports, {Family}

CarType

{Sports} {Family, CarType

Family Sports Luxury

(find best partition of values)

{ pLuxury} {Family}

C1 9 1 C2 7 3

Gini 0 468

{Sports} { yLuxury}

C1 8 2 C2 0 10

Gini 0 167

Family Sports LuxuryC1 1 8 1 C2 3 0 7

Gini 0 163 Gini 0.468

Gini 0.167

Gini 0.163

38

Continuous Attributes: Computing Gini Index• Use Binary Decisions based on one

value• Several Choices for the splitting value

ID Home Owner

Marital Status

Annual Income Defaulted

• Several Choices for the splitting valueNumber of possible splitting values = Number of distinct values

E h litti l h t t i



3 No Single 70K No

• Each splitting value has a count matrix associated with it

Class counts in each of the



6 No Married 60K No

7 Y Di d 220K Npartitions, A < v and A ≥ v• Simple method to choose best v

For each v, scan the database to


8 No Single 85K Yes

9 No Married 75K No

10 N Si l 90K YFor each v, scan the database to gather count matrix and compute its Gini indexComputationally Inefficient!

10 No Single 90K Yes10

Computationally Inefficient! Repetition of work.

39

Continuous Attributes: Computing Gini Index...• For efficient computation: for each attribute,

Sort the attribute on valuesLi l th l h ti d ti th t t i dLinearly scan these values, each time updating the count matrix and computing gini indexChoose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

Annual Income

60 70 75 85 90 95 100 120 125 220

55 65 72 80 87 92 97 110 122 172 230

< > < > < > < > < > < > < > < > < > < > < >

Split PositionsSorted Values

<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >

Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

40

Measure of Impurity: Entropy

• Entropy at a given node t:

jjE )|(l)|()(

(NOTE ( j | t) i th l ti f f l j t d t)

∑−=j

tjptjptEntropy )|(log)|()(

(NOTE: p( j | t) is the relative frequency of class j at node t).

Maximum (log nc) when records are equally distributed among all ( g c) q y gclasses implying least informationMinimum (0.0) when all records belong to one class, implying most information

Entropy based computations are quite similar to the GINI index computations

41

Computing Entropy of a Single Node

∑−=j

tjptjptEntropy )|(log)|()(2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

j

C1 0 C2 6

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 C2 5

P(C1) = 1/6 P(C2) = 5/6

C2 5

Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 C2 4

P(C1) = 2/6 P(C2) = 4/6

Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

42

Computing Information Gain After Splittingp g• Information Gain:

⎞⎛ n⎟⎠⎞

⎜⎝⎛−= ∑

=

k

i

i

splitiEntropy

nnpEntropyGAIN

1)()(

Parent Node, p is split into k partitions;ni is number of records in partition i

Choose the split that achieves most reduction (maximizes GAIN)GAIN)

Used in the ID3 and C4.5 decision tree algorithms

43

Problems with Information Gain

• Info Gain tends to prefer splits that result in large number of partitions, each being small but purenumber of partitions, each being small but pure

Customer ID has highest information gain becauseCustomer ID has highest information gain because entropy for all the children is zero

44

Gain Ratio

• Gain Ratio:

GAINSplitINFO

GAINGainRATIO Split

split= ∑

=−=

k

i

ii

nn

nnSplitINFO

1log

Parent Node, p is split into k partitionsni is the number of records in partition i

Adjusts Information Gain by the entropy of the partitioning (SplitINFO).

Higher entropy partitioning (large number of small partitions) isHigher entropy partitioning (large number of small partitions) is penalized!

Used in C4.5 algorithmDesigned to overcome the disadvantage of Information Gain

45

Measure of Impurity: Classification Error• Classification error at a node t :

)|(max1)( tiPtErrori

−=

Maximum (1 - 1/nc) when records are equally distributed ll l i l i l t i t ti i f ti

i

among all classes, implying least interesting informationMinimum (0) when all records belong to one class, implying most interesting informationimplying most interesting information

46

Computing Error of a Single Node

)|(max1)( tiPtErrori

−=

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

)|()(i

C1 0 C2 6

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 C2 5

P(C1) = 1/6 P(C2) = 5/6

C2 5

Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 C2 4

P(C1) = 2/6 P(C2) = 4/6

Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

47

Comparison among Impurity MeasuresFor a 2-class problem:

48不同的不纯性度量是一致的。但是，作为测试条件的属性选择仍然因不纯性度量的选择而异。

表示属于其中一个类的记录所占的比例

Misclassification Error vs Gini Index

A? Parent

Yes NoC1 7

C2 3 Gi i 0 42Node N1 Node N2 Gini = 0.42

N1 N2 C1 3 4

Gini(N1) = 1 – (3/3)2 – (0/3)2 Gini(Children)

= 3/10 * 0C1 3 4C2 0 3 Gini=0.342

= 0

Gini(N2) = 1 (4/7)2 (3/7)2

= 3/10 * 0 + 7/10 * 0.489= 0.342

= 1 – (4/7)2 – (3/7)2

= 0.489 Gini improves but error remains the same!!

49

same!!

Tree Induction

• Greedy strategy.Split the records based on an attribute test that optimizesSplit the records based on an attribute test that optimizes certain criterion.

• IssuesD t i h t lit th dDetermine how to split the records

How to specify the attribute test condition?How to determine the best split?How to determine the best split?

Determine when to stop splitting

50

Stopping Criteria for Tree Induction

• Stop expanding a node when all the records belong to the same classto the same class

St di d h ll th d h• Stop expanding a node when all the records have similar attribute values

• Early termination (to be discussed later)y ( )

51

Decision Tree Based Classification

• Advantages:Inexpensive to constructInexpensive to constructExtremely fast at classifying unknown recordsEasy to interpret for small-sized treesEasy to interpret for small-sized treesAccuracy is comparable to other classification techniques for many simple data setsfor many simple data sets

52

Example: C4.5

• Simple depth-first construction.• Uses Information Gain• Uses Information Gain• Sorts Continuous Attributes at each node.• Needs entire data to fit in memory.• Unsuitable for Large Datasets.g

Needs out-of-core sorting.

• You can download the software from:http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gzp q g

53

Classification: Model Overfitting andClassification: Model Overfitting and Classifier Evaluation

Classification Errors

• Training errors (apparent errors)Errors committed on the training setErrors committed on the training set

• Test errors• Test errorsErrors committed on the test set

• Generalization errorsExpected error of a model over random selection of records from same distribution（未知记录上的期望误差）

Example Data Set

Two class problem:

+, o

3000 data points (30% for training 70% for testing)training, 70% for testing)

Data set for + class is generated from a uniform distribution

Data set for o class is generated from a mixture of 3 gaussian gdistributions, centered at (5,15), (10,5), and (15,15)

Decision Trees

Decision Tree with 11 leaf nodes Decision Tree with 24 leaf nodesDecision Tree with 11 leaf nodes Decision Tree with 24 leaf nodes

Which tree is better?Which tree is better?

Model Overfitting

Underfitting: when model is too simple, both training and test errors are large

Overfitting: when model is too complex, training error is small but test error is largeOverfitting: when model is too complex, training error is small but test error is large

Overfitting due to Noise

Decision boundary is distorted by noise point

59

y y p

Overfitting due to Insufficient Examplesp

Lack of data points in the lower half of the diagram makes it difficult t di t tl th l l b l f th t ito predict correctly the class labels of that region

- Insufficient number of training records in the region causes the decision tree to predict the test examples using other training

60

g grecords that are irrelevant to the classification task

Mammal Classification Problem

Training Set

Decision Tree Model

training error = 0%training error = 0%

Effect of Noise (Data is wrong)

Training Set:

Example: Mammal Classification problemModel M1:

t i 0%Body Temperature

Training Set: train err = 0%,

test err = 30%

Warm-blooded Cold-blooded

Give BirthNon-

mammals

Test Set:

Yes No

M l Non- Model M2:Mammals mammals train err = 20%,

test err = 10%

Lack of Representative Samples

Training Set:

Test Set:

Model M3:

train err = 0%,

test err = 30%

Lack of training records at the leaf nodes for making reliable classificationclassification

Effect of Multiple Comparison Procedure

• Consider the task of predicting whether stock market will rise/fall in the next 10

Day 1 Upstock market will rise/fall in the next 10 trading days

Day 2 DownDay 3 Down

• Random guessing:P(correct) = 0.5

Day 4 UpDay 5 DownD 6 D( )

• Make 10 random guesses in a row:

Day 6 DownDay 7 UpDay 8 UpMake 10 random guesses in a row: Day 8 UpDay 9 UpDay 10 Down101010

⎟⎟⎞

⎜⎜⎛

+⎟⎟⎞

⎜⎜⎛

+⎟⎟⎞

⎜⎜⎛ Day 10 Down

0547.02

1098)8(# 10 =

⎟⎟⎠

⎜⎜⎝

+⎟⎟⎠

⎜⎜⎝

+⎟⎟⎠

⎜⎜⎝=≥correctP

Effect of Multiple Comparison Procedure

• Approach:Get 50 analystsEach analyst makes 10 random guessesChoose the analyst that makes the most number of correct predictions

• Probability that at least one analyst makes at least 8 correct predictions

9399.0)0547.01(1)8(# 50 =−−=≥correctP

Effect of Multiple Comparison Procedure• Many algorithms employ the following greedy strategy:

Initial model: MAlternative model: M’ = M ∪ γ, where γ is a component to be added to the model (e.g., a test condition of a decision tree)test condition of a decision tree)Keep M’ if improvement, Δ(M,M’) > α

• Often times, γ is chosen from a set of alternative components, Γ = {γ1, γ2, …, γk}p {γ1 γ2 γk}

• If many alternatives are available, one may inadvertently add irrelevant components to the model, resulting in model overfitting

Notes on Overfitting

• Overfitting results in decision trees that are more complex than necessarycomplex than necessary

T i i l id d ti t• Training error no longer provides a good estimate of how well the tree will perform on previously

dunseen records

• Need new ways for estimating generalization errors

Incorporating Model Complexity

• Rationale: Occam’s RazorGiven two models of similar generalization errors oneGiven two models of similar generalization errors, one should prefer the simpler model over the more complex model

A complex model has a greater chance of being fitted accidentally by errors in data

Th f h ld i l d d l l it hTherefore, one should include model complexity when evaluating a model

Minimum Description Length (MDL)

A?

B?0

Yes NoX yX1 1

X yX

A B

B?

C?

0

1

B1 B2

C1 C2

1 1X2 0X3 0X4 1

X1 ?X2 ?X3 ?

10X4 1… …Xn 1

X4 ?… …Xn ?

• Cost(Model,Data) = Cost(Data|Model) + Cost(Model)Cost is the number of bits needed for encoding

n ?

Cost is the number of bits needed for encoding.Search for the least costly model.

• Cost(Data|Model) encodes the misclassification errors• Cost(Data|Model) encodes the misclassification errors.• Cost(Model) uses node encoding (number of children)

plus splitting condition encoding.p p g g

估计统计上界

• 使用训练误差的统计修正来估计泛化误差，因为泛化误差通常较大，因此统计修正经常是计算训练误差的误差通常较大，因此统计修正经常是计算训练误差的上界

• 标记:• 标记:e 节点的错误率

x 真实的错误率

keN

=x 真实的错误率

N 训练样本总数

k 被错分的样本总数

N

k 被错分的样本总数

α 是置信水平

是标准正态分布的标准化值z 是标准正态分布的标准化值/2zα

70

估计统计上界

• N个样本，有k 个样本错分的概率符合二项式分布

态布似项布

( , ) (1 )k k N kNp k N C x x −= −

• 用正态分布可以近似二项分布

Nxμ2 (1 )

NxNx x

μ

σ

=

= −( , (1 ))k N Nx Nx x−

71

估计统计上界

• 用正态分布可以近似二项分布

( , (1 ))k N Nx Nx x−( , (1 ))

(0,1)(1 )

k N Nx Nx xk Nx NNx x−

/2

(1 )Nx xk Nx zα

−

−≤ /2(1 )Nx x

Ne Nx z

α−

−≤ /2(1 )

zNx x α≤

−

72

估计统计上界

/2Ne Nx z−

≤ /2

2 2 2 2/2 /2

(1 )

( ) (2 ) 0

zNx x

N z x Ne z x Ne

α

α α

≤−

+ − + + ≤/2 /2( ) ( )α α

• 解方程有：2 2

/2 /2(1 )z ze ee α α−+ + +/2 /2

/2 2

2/2

( )2 4( , , )

1upper

e zN N Ne N e

z

α αα

α

α+ + +

=+ /21

Nα+

73

Using Validation Set（确认集）

• Divide training data into two parts:Training set:Training set:

use for model building

Validation set:Validation set: use for estimating generalization errorNote: validation set is not the same as test set

• Drawback:Less data available for training

Handling Overfitting in Decision Tree

• Pre-Pruning (Early Stopping Rule)St th l ith b f it b f ll tStop the algorithm before it becomes a fully-grown treeTypical stopping conditions for a node:

St if ll i t b l t th lStop if all instances belong to the same classStop if all the attribute values are the same

More restrictive conditions:More restrictive conditions:Stop if number of instances is less than some user-specified thresholdStop if class distribution of instances are independent of the available features (e.g., using χ 2 test)Stop if expanding the current node does not improve impurityStop if expanding the current node does not improve impurity

measures (e.g., Gini or information gain).Stop if estimated generalization error falls below certain threshold

Handling Overfitting in Decision Tree

• Post-pruningGrow decision tree to its entiretyGrow decision tree to its entiretySubtree replacement

Trim the nodes of the decision tree in a bottom-up fashionTrim the nodes of the decision tree in a bottom up fashionIf generalization error improves after trimming, replace sub-tree by a leaf node Class label of leaf node is determined from majority class of instances in the sub-tree

Subtree raisingSubtree raisingReplace subtree with most frequently used branch

Example of Post-Pruning

Class = Yes 20

Training Error (Before splitting) = 10/30

Pessimistic error = (10 + 0 5)/30 = 10 5/30Class = Yes 20

Class = No 10

Error = 10/30

Pessimistic error (10 + 0.5)/30 10.5/30

Training Error (After splitting) = 9/30

Pessimistic error (After splitting)

A?

Error = 10/30 ess st c e o ( te sp tt g)

= (9 + 4 × 0.5)/30 = 11/30

PRUNE!A?

A1

A2 A3

A4

Class = Yes 8Class = No 4



Class = Yes 5Class = No 1Class No 4 Class No 4 Class No 1 Class No 1

Examples of Post-pruning

Handling Missing Attribute Values（缺失值）

• Missing values affect decision tree construction in three different ways:three different ways:

Affects how impurity measures are computedAffects how to distribute instance with missing value toAffects how to distribute instance with missing value to child nodesAffects how a test instance with missing value isAffects how a test instance with missing value is classified

Computing Impurity Measure（忽略）

Tid Refund Marital Status

Taxable Income Class

Before Splitting:Entropy(Parent)


2 No Married 100K No Class= Yes

Class = No

= -0.3 log(0.3)-(0.7)log(0.7) = 0.8813

3 No Single 70K No



= Yes = NoRefund=Yes 0 3 Refund=No 2 4

Ref nd ? 1 06 No Married 60K No


Refund=? 1 0

Split on Refund:

E t (R f d Y ) 08 No Single 85K Yes

9 No Married 75K No

10 ? Single 90K Yes

Entropy(Refund=Yes) = 0

Entropy(Refund=No) = -(2/6)log(2/6) – (4/6)log(4/6) = 0.918310 ? Single 90K Yes

10

( ) g( ) ( ) g( )

Entropy(Children) = 0.3 (0) + 0.6 (0.9183) = 0.551Missing

valueGain = 0.9 × (0.8813 – 0.551) = 0.3303

value

Distribute Instances（根据分布预测）


Taxable Income Class

Tid Refund Marital Taxable1 Yes Single 125K No


3 No Single 70K No

Tid Refund MaritalStatus

TaxableIncome Class

10 ? Single 90K Yes 10 g



6 No Married 60K No

RefundYes No

6 No Married 60K No


8 No Single 85K Yes

9 N M i d 75K N

Class=Yes 2 + 6/9

Class=No 4

Class=Yes 0 + 3/9

Class=No 3

9 No Married 75K No10

RefundYes No

Probability that Refund=Yes is 3/9

Probability that Refund=No is 6/9

Class=Yes 0

Class=No 3

Cheat=Yes 2

Cheat=No 4

Assign record to the left child with weight = 3/9 and to the right child with weight = 6/9

g

Classify Instances

Married Single Divorced TotalNew record:

Class=No 3 1 0 4

Cl Y 6/9 1 1 2 67


TaxableIncome Class

11 No ? 85K ?

Refund

Class=Yes 6/9 1 1 2.67

Total 3.67 2 1 6.67

10

MarStNO

Yes No

M i dSingle,

TaxInc NO

Marriedg ,Divorced

< 80K > 80K

Probability that Marital Status = Married is 3.67/6.67

Probability that Marital Status

YESNO

< 80K > 80K Probability that Marital Status ={Single,Divorced} is 3/6.67

Other Issues

• Data Fragmentation（数据碎片）

• Search Strategy（搜索策略）• Search Strategy（搜索策略）

• Expressiveness（可表达性）

复• Tree Replication（重复）

Data Fragmentation

• Number of instances gets smaller as you traverse down the treedown the tree

N b f i t t th l f d ld b t• Number of instances at the leaf nodes could be too small to make any statistically significant decision

Search Strategy

• Finding an optimal decision tree is NP-hard

• The algorithm presented so far uses a greedy, top-down recursive partitioning strategy to induce adown, recursive partitioning strategy to induce a reasonable solution

• Other strategies?Bottom upBottom-upBi-directional

Expressiveness

• Decision tree provides expressive representation for learning discrete-valued function

But they do not generalize well to certain types of Boolean functions

Example: parity function（奇偶函数）:Example: parity function（奇偶函数）: – Class = 1 if there is an even number of Boolean attributes with truth

value = TrueClass = 0 if there is an odd number of Boolean attributes with truth– Class = 0 if there is an odd number of Boolean attributes with truth value = True

For accurate modeling, must have a complete tree

• Not expressive enough for modeling continuous variablesParticularly when test condition involves only a singleParticularly when test condition involves only a single attribute at-a-time

Decision Boundary

• Border line between two neighboring regions of different classes isBorder line between two neighboring regions of different classes is known as decision boundary

• Decision boundary is parallel to axes because test condition involves a single attribute at a timea single attribute at-a-time

Oblique(斜) Decision Trees

x + y < 1

Class = + Class =

• Test condition may involve multiple attributes

• More expressive representation

Fi di ti l t t diti i t ti ll i• Finding optimal test condition is computationally expensive

Tree Replication

P

Q R

S 0 1Q

0 1 S 0

0 1

• Same subtree appears in multiple branches• Same subtree appears in multiple branches

Documents

Classification: Basic Concepts, Decision Trees, and …staff.ustc.edu.cn/~qiliuql/files/DM2013/slide3.2[ch4].pdf · Classification: Basic Concepts, Decision Trees, and Model Evaluation