Bab 07 - Class & Predict - Part 1.ppt

Embed Size (px)

Citation preview

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    1/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 1/52

    Data Mining

    Bab 7 Part 1

    Classification & Prediction

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    2/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 2/52

    Classification: predicts categorical class labels

    classifies data (constructs a model) based on the training set andthe values (class labels)in a classifying attribute and uses it inclassifying new data

    Prediction: models continuous-valued functions, i.e., predicts unknown or

    missing values

    Typical Applications credit approval

    target marketing medical diagnosis

    treatment effectiveness analysis

    Classification vs. Prediction

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    3/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 3/52

    Classification: Definition

    Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the

    class.

    Find a model for class attribute as a function of thevalues of other attributes.

    Goal:previously unseenrecords should be assigned aclass as accurately as possible. A test setis used to determine the accuracy of the model. Usually, the

    given data set is divided into training and test sets, with training set

    used to build the model and test set used to validate it.

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    4/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 4/52

    Illustrating Classification Task

    Apply

    Model

    Induction

    Deduction

    Learn

    Model

    Model

    Tid Attrib1

    Attrib2 Attrib3 Class

    1

    Yes Large 125K

    No

    2

    No Medium 100K

    No

    3

    No Small 70K

    No

    4

    Yes Medium 120K

    No

    5

    No Large 95K

    Yes

    6

    No Medium 60K

    No

    7

    Yes Large 220K

    No

    8

    No Small 85K

    Yes

    9

    No Medium 75K

    No

    10

    No Small 90K

    Yes10

    Tid Attrib1

    Attrib2 Attrib3 Class

    11

    No Small 55K

    ?

    12

    Yes Medium 80K

    ?

    13

    Yes Large 110K

    ?

    14

    No Small 95K

    ?

    15

    No Large 67K

    ?10

    Test Set

    Learning

    algorithm

    Training Set

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    5/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 5/52

    Examples of Classification Task

    Predicting tumor cells as benign or malignant

    Classifying credit card transactionsas legitimate or fraudulent

    Classifying secondary structures of proteinas alpha-helix, beta-sheet, or randomcoil

    Categorizing news stories as finance,weather, entertainment, sports, etc

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    6/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 6/52

    ClassificationA Two-Step Process

    Model construction (induction): describing a set of predetermined

    classes Each tuple/sample is assumed to belong to a predefined class, as

    determined by the class label attribute

    The set of tuples used for model construction: training set

    The model is represented as classification rules, decision trees, ormathematical formulae

    Model usage (deduction): for classifying future or unknown objects Estimate accuracy of the model

    The known label of test sample is compared with the classified resultfrom the model

    Accuracy rate is the percentage of test set samples that are correctly

    classified by the model Test set is independent of training set, otherwise over-fittingwill occur

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    7/52Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 7/52

    Supervised vs. Unsupervised Learning

    Supervised learning (classification) Supervision: the training data (observations,

    measurements, etc.) are accompanied by labels

    indicating the class of the observations

    New data is classified based on the training set

    Unsupervised learning(clustering)

    The class labels of training data is unknown

    Given a set of measurements, observations, etc. with the

    aim of establishing the existence of classes or clusters in

    the data

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    8/52

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    9/52Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 9/52

    Issues regarding classification and prediction:(1) Data Preparation

    Data cleaning

    Preprocess data in order to reduce noise and handle missing

    values

    Relevance analysis (feature selection) Remove the irrelevant or redundant attributes

    Data transformation

    Generalize and/or normalize data

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    10/52Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 10/52

    Issues regarding classification and prediction:

    (2) Evaluating Classification Methods

    Predictive accuracy Speed

    time to construct the model time to use the model

    Robustness handling noise and missing values

    Scalability efficiency in disk-resident databases

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    11/52Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 11/52

    Classification by Decision Tree Induction

    Decision tree A flow-chart-like tree structure

    Internal node denotes a test on an attribute

    Branch represents an outcome of the test

    Leaf nodes represent class labels or class distribution

    Decision tree generation consists of two phases Tree construction

    At start, all the training examples are at the root

    Partition examples recursively based on selected attributes

    Tree pruning

    Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample

    Test the attribute values of the sample against the decision tree

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    12/52Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 12/52

    Example of a Decision Tree

    Tid Refund MaritalStatus

    TaxableIncome Cheat

    1 Yes Single 125K No

    2 No Married 100K No

    3 No Single 70K No

    4 Yes Married 120K No

    5 No Divorced 95K Yes

    6 No Married 60K No

    7 Yes Divorced 220K No

    8 No Single 85K Yes

    9 No Married 75K No

    10 No Single 90K Yes10

    Refund

    MarSt

    TaxInc

    YESNO

    NO

    NO

    Yes No

    MarriedSingle, Divorced

    < 80K > 80K

    Spl i t t ing Attr ibutes

    Training Data Model: Decision Tree

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    13/52Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 13/52

    Another Example of Decision Tree

    Tid Refund MaritalStatus

    TaxableIncome Cheat

    1 Yes Single 125K No

    2 No Married 100K No

    3 No Single 70K No

    4 Yes Married 120K No

    5 No Divorced 95K Yes

    6 No Married 60K No

    7 Yes Divorced 220K No

    8 No Single 85K Yes

    9 No Married 75K No

    10 No Single 90K Yes10

    MarSt

    Refund

    TaxInc

    YESNO

    NO

    NO

    Yes No

    MarriedSingle,

    Divorced

    < 80K > 80K

    There could be more than one tree thatfits the same data !

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    14/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 14/52

    Decision Tree Classification Task

    Apply

    Model

    Induction

    Deduction

    Learn

    Model

    Model

    Tid Attrib1 Attrib2 Attrib3 Class

    1 Yes Large 125K No

    2 No Medium 100K No

    3 No Small 70K No

    4 Yes Medium 120K No

    5 No Large 95K Yes

    6 No Medium 60K No

    7 Yes Large 220K No

    8 No Small 85K Yes

    9 No Medium 75K No

    10 No Small 90K Yes10

    Tid Attrib1 Attrib2 Attrib3 Class

    11 No Small 55K ?

    12 Yes Medium 80K ?

    13 Yes Large 110K ?

    14 No Small 95K ?

    15 No Large 67K ?10

    Test Set

    TreeInduction

    algorithm

    Training Set

    Decision

    Tree

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    15/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 15/52

    Apply Model to Test Data (1)

    Refund

    MarSt

    TaxInc

    YESNO

    NO

    NO

    Yes No

    MarriedSingle, Divorced

    < 80K > 80K

    Refund MaritalStatus

    TaxableIncome Cheat

    No Married 80K ?10

    Test Data

    Start from the root of tree.

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    16/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 16/52

    Refund

    MarSt

    TaxInc

    YESNO

    NO

    NO

    Yes No

    MarriedSingle, Divorced

    < 80K > 80K

    Refund MaritalStatus

    TaxableIncome Cheat

    No Married 80K ?10

    Test Data

    Apply Model to Test Data (2)

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    17/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 17/52

    Refund

    MarSt

    TaxInc

    YESNO

    NO

    NO

    Yes No

    MarriedSingle, Divorced

    < 80K > 80K

    Refund MaritalStatus

    TaxableIncome Cheat

    No Married 80K ?10

    Test Data

    Apply Model to Test Data (3)

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    18/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 18/52

    Refund

    MarSt

    TaxInc

    YESNO

    NO

    NO

    Yes No

    MarriedSingle, Divorced

    < 80K > 80K

    Refund MaritalStatus

    TaxableIncome Cheat

    No Married 80K ?10

    Test Data

    Apply Model to Test Data (4)

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    19/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 19/52

    Refund

    MarSt

    TaxInc

    YESNO

    NO

    NO

    Yes No

    MarriedSingle, Divorced

    < 80K > 80K

    Refund MaritalStatus

    TaxableIncome Cheat

    No Married 80K ?10

    Test Data

    Apply Model to Test Data (5)

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    20/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 20/52

    Refund

    MarSt

    TaxInc

    YESNO

    NO

    NO

    Yes No

    MarriedSingle, Divorced

    < 80K > 80K

    Refund MaritalStatus

    TaxableIncome Cheat

    No Married 80K ?10

    Test Data

    Assign Cheat to No

    Apply Model to Test Data (6)

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    21/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 21/52

    Decision Tree Classification Task

    Apply

    Model

    Induction

    Deduction

    Learn

    Model

    Model

    Tid Attrib1 Attrib2 Attrib3 Class

    1 Yes Large 125K No

    2 No Medium 100K No

    3 No Small 70K No

    4 Yes Medium 120K No

    5 No Large 95K Yes

    6 No Medium 60K No

    7 Yes Large 220K No

    8 No Small 85K Yes

    9 No Medium 75K No

    10 No Small 90K Yes10

    Tid Attrib1 Attrib2 Attrib3 Class

    11 No Small 55K ?

    12 Yes Medium 80K ?

    13 Yes Large 110K ?

    14 No Small 95K ?

    15 No Large 67K ?10

    Test Set

    TreeInduction

    algorithm

    Training Set

    Decision Tree

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    22/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 22/52

    Algorithm for Decision Tree Induction

    Basic algorithm Tree is constructed in a top-down recursive manner

    At start, all the training examples are at the root

    Attributes are categorical (if continuous-valued, they are discretized inadvance)

    Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical

    measure (e.g., information gain)

    Conditions for stopping partitioning

    All samples for a given node belong to the same class

    There are no remaining attributes for further partitioning majorityvotingis employed for classifying the leaf

    There are no samples left

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    23/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 23/52

    Decision Tree Induction

    Many Algorithms exist, for examples:

    Hunts Algorithm (one of the earliest)

    CART

    ID3, C4.5

    SLIQ, SPRINT

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    24/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 24/52

    General Structure of HuntsAlgorithm

    Let Dtbe the set of trainingrecords that reach a node t

    General Procedure: If D

    t

    contains records that belongthe same class yt, then t is a leafnode labeled as yt

    If Dtis an empty set, then t is aleaf node labeled by the defaultclass, yd

    If Dtcontains records that belongto more than one class, use anattribute test to split the data intosmaller subsets. Recursivelyapply the procedure to eachsubset.

    Tid Refund MaritalStatus

    TaxableIncome Cheat

    1 Yes Single 125K No

    2 No Married 100K No

    3 No Single 70K No

    4 Yes Married 120K No

    5 No Divorced 95K Yes

    6 No Married 60K No

    7 Yes Divorced 220K No

    8 No Single 85K Yes

    9 No Married 75K No

    10 No Single 90K Yes10

    Dt

    ?

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    25/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 25/52

    Hunts Algorithm

    Dont

    Cheat

    Refund

    Dont

    Cheat

    Dont

    Cheat

    Yes No

    Refund

    Dont

    Cheat

    Yes No

    Marital

    Status

    Dont

    Cheat

    Cheat

    Single,

    DivorcedMarried

    Taxable

    Income

    Dont

    Cheat

    < 80K >= 80K

    Refund

    Dont

    Cheat

    Yes No

    Marital

    Status

    Dont

    CheatCheat

    Single,

    DivorcedMarried

    Tid Refund MaritalStatus

    TaxableIncome Cheat

    1 Yes Single 125K No

    2 No Married 100K No

    3 No Single 70K No

    4 Yes Married 120K No

    5 No Divorced 95K Yes

    6 No Married 60K No

    7 Yes Divorced 220K No

    8 No Single 85K Yes

    9 No Married 75K No

    10 No Single 90K Yes10

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    26/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 26/52

    Tree Induction

    Greedy strategy.

    Split the records based on an attribute test that

    optimizes certain criterion.

    Issues

    Determine how to split the records

    How to specify the attribute test condition?

    How to determine the best split?

    Determine when to stop splitting

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    27/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 27/52

    How to Specify Test Condition?

    Depends on attribute types

    Nominal

    Ordinal

    Continuous

    Depends on number of ways to split

    2-way split

    Multi-way split

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    28/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 28/52

    Splitting Based on Nominal Attributes

    Multi-way split:Use as many partitions asdistinct values.

    Binary split: Divides values into two subsets.Need to find optimal partitioning.

    CarTypeFamily

    Sports

    Luxury

    CarType{Family,

    Luxury} {Sports}

    CarType{Sports,

    Luxury} {Family}OR

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    29/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 29/52

    Multi-way split:Use as many partitions as

    distinct values.

    Binary split: Divides values into two subsets.Need to find optimal partitioning.

    What about this split?

    Splitting Based on Ordinal Attributes

    SizeSmall

    Medium

    Large

    Size{Medium,

    Large} {Small}

    Size{Small,

    Medium} {Large}

    OR

    Size{Small,

    Large} {Medium}

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    30/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 30/52

    Splitting Based on Continuous Attributes (1)

    Different ways of handling Discretizationto form an ordinal categorical attribute

    Staticdiscretize once at the beginning

    Dynamicranges can be found by equal intervalbucketing, equal frequency bucketing

    (percentiles), or clustering.

    Binary Decision: (A < v) or (A v) consider all possible splits and finds the best cut

    can be more compute intensive

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    31/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 31/52

    Taxable

    Income

    > 80K?

    Yes No

    Taxable

    Income?

    (i) Binary split (ii) Multi-way split

    < 10K

    [10K,25K) [25K,50K) [50K,80K)

    > 80K

    Splitting Based on Continuous Attributes (2)

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    32/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 32/52

    Tree Induction

    Greedy strategy.

    Split the records based on an attribute test thatoptimizes certain criterion.

    Issues

    Determine how to split the records

    How to specify the attribute test condition?

    How to determine the best split?

    Determine when to stop splitting

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    33/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 33/52

    How to determine the Best Split (1)

    Own

    Car?

    C0: 6

    C1: 4

    C0: 4

    C1: 6

    C0: 1

    C1: 3

    C0: 8

    C1: 0

    C0: 1

    C1: 7

    Car

    Type?

    C0: 1

    C1: 0

    C0: 1

    C1: 0

    C0: 0

    C1: 1

    Student

    ID?

    ...

    Yes NoFamily

    Sports

    Luxury c1 c

    10

    c20

    C0: 0

    C1: 1...

    c11

    Before Splitting: 10 records of class 0,

    10 records of class 1

    Which test condition is the best?

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    34/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 34/52

    How to determine the Best Split (2)

    Greedy approach:

    Nodes with homogeneousclass distribution arepreferred

    Need a measure of node impurity:

    C0: 5

    C1: 5

    C0: 9

    C1: 1

    Non-homogeneous,

    High degree of impurity

    Homogeneous,

    Low degree of impurity

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    35/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 35/52

    Measures of Node Impurity

    Gini Index

    Entropy

    Misclassification error

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    36/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 36/52

    How to Find the Best Split

    B?

    Yes No

    Node N3 Node N4

    A?

    Yes No

    Node N1 Node N2

    Before Splitting:

    C0 N10

    C1 N11

    C0 N20

    C1 N21

    C0 N30

    C1 N31

    C0 N40

    C1 N41

    C0 N00

    C1 N01M0

    M1 M2 M3 M4

    M12 M34Gain= M0M12 vs. M0M34

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    37/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 37/52

    Measure of Impurity: GINI

    Gini Index for a given node t :

    (NOTE:p( j | t) is the relative frequency of class j at node t).

    Maximum (1 - 1/nc) when records are equally distributed amongall classes, implying least interesting information

    Minimum (0.0) when all records belong to one class, implyingmost interesting information

    j

    tjptGINI 2)]|([1)(

    C1 0

    C2 6

    Gini=0.000

    C1 2

    C2 4

    Gini=0.444

    C1 3

    C2 3

    Gini=0.500

    C1 1

    C2 5

    Gini=0.278

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    38/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 38/52

    Examples for computing GINI

    C1 0

    C2 6

    C1 2

    C2 4

    C1 1

    C2 5

    P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

    Gini = 1P(C1)2P(C2)2= 101 = 0

    j

    tjptGINI 2)]|([1)(

    P(C1) = 1/6 P(C2) = 5/6

    Gini = 1(1/6)2(5/6)2= 0.278

    P(C1) = 2/6 P(C2) = 4/6

    Gini = 1(2/6)2(4/6)2= 0.444

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    39/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 39/52

    Splitting Based on GINI Used in CART, SLIQ, SPRINT.

    When a node p is split into k partitions (children), thequality of split is computed as,

    where, ni= number of records at child i,

    n = number of records at node p.

    k

    i

    i

    split iGINIn

    n

    GINI 1 )(

    Binary Attributes:

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    40/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 40/52

    Binary Attributes:Computing GINI Index

    Splits into two partitions Effect of Weighing partitions:

    Larger and Purer Partitions are sought for.

    B?

    Yes No

    Node N1 Node N2

    Parent

    C1 6

    C2 6

    Gini = 0.500

    N1 N2C1 5 1

    C2 2 4

    Gini=0.371

    Gini(N1)

    = 1(5/7)2(2/7)2

    = 0.408

    Gini(N2)

    = 1(1/5)2(4/5)2

    = 0.32

    Gini(Children)= 7/12 * 0.408 +

    5/12 * 0.32

    = 0.371

    Categorical Attributes:

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    41/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 41/52

    Categorical Attributes:Computing Gini Index

    For each distinct value, gather counts for each class inthe dataset

    Use the count matrix to make decisions

    CarType

    {Sports,Luxury}

    {Family}

    C1 3 1

    C2 2 4Gini 0.400

    CarType

    {Sports}{Family,Luxury}

    C1 2 2

    C2 1 5Gini 0.419

    CarType

    Family Sports Luxury

    C1 1 2 1

    C2 4 1 1Gini 0.393

    Multi-way split Two-way split

    (find best partition of values)

    C ti Att ib t

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    42/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 42/52

    Continuous Attributes:Computing Gini Index

    Use Binary Decisions based on onevalue

    Several Choices for the splitting value Number of possible splitting values

    = Number of distinct values

    Each splitting value has a count matrixassociated with it Class counts in each of the partitions, A

    < v and A v

    Simple method to choose best v For each v, scan the database to gather

    count matrix and compute its Gini index Computationally Inefficient! Repetition

    of work.

    Tid Refund MaritalStatus

    TaxableIncome Cheat

    1 Yes Single 125K No

    2 No Married 100K No

    3 No Single 70K No

    4 Yes Married 120K No

    5 No Divorced 95K Yes

    6 No Married 60K No

    7 Yes Divorced 220K No

    8 No Single 85K Yes

    9 No Married 75K No

    10 No Single 90K Yes10

    TaxableIncome

    > 80K?

    Yes No

    Continuous Attributes:

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    43/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 43/52

    Continuous Attributes:Computing Gini Index... For efficient computation: for each attribute,

    Sort the attribute on values Linearly scan these values, each time updating the count matrix and

    computing gini index

    Choose the split position that has the least gini index

    Cheat No No No Yes Yes Yes No No No No

    Taxable Income

    60 70 75 85 90 95 100 120 125 220

    55 65 72 80 87 92 97 110 122 172 230

    Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

    No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

    Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

    Split Positions

    Sorted Values

    Alt tiv S litti C it i

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    44/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 44/52

    Alternative Splitting Criteriabased on INFO

    Entropy at a given node t:

    (NOTE:p( j | t) is the relative frequency of class j at node t).

    Measures homogeneity of a node. Maximum (log nc) when records are equally distributed among all

    classes implying least information

    Minimum (0.0) when all records belong to one class, implyingmost information

    Entropy based computations are similar to the GINIindex computations

    j

    tjptjptEntropy )|(log)|()(

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    45/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 45/52

    Examples for computing Entropy

    C1 0

    C2 6

    C1 2

    C2 4

    C1 1

    C2 5

    P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

    Entropy =0 log 01 log 1 =00 = 0

    P(C1) = 1/6 P(C2) = 5/6

    Entropy =(1/6) log2(1/6)(5/6) log2(1/6) = 0.65

    P(C1) = 2/6 P(C2) = 4/6

    Entropy =(2/6) log2(2/6)(4/6) log2(4/6) = 0.92

    j

    tjptjptEntropy )|(log)|()(2

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    46/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 46/52

    Splitting Based on INFO...

    Information Gain:

    Parent Node, p is split into k partitions;

    niis number of records in partition i

    Measures Reduction in Entropy achieved because of the split.Choose the split that achieves most reduction (maximizes GAIN)

    Used in ID3 and C4.5

    Disadvantage:Tends to prefer splits that result in large number of

    partitions, each being small but pure.

    k

    i

    i

    spl itiEntropy

    n

    npEntropyGAIN

    1

    )()(

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    47/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 47/52

    age income student credit_rating40 low yes fair>40 low yes excellent3140 low yes excellent

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    48/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 48/52

    Attribute Selection byInformation Gain Computation

    Class P: buys_computer = yes

    Class N: buys_computer = no

    I(p, n) = I(9, 5) =0.940

    Compute the entropy for age: Hence

    Similarly

    age pi ni I(pi, ni)

    40 3 2 0.971

    69.0)2,3(14

    5

    )0,4(14

    4)3,2(

    14

    5)(

    I

    IIageE

    048.0)_(

    151.0)(029.0)(

    ratingcreditGain

    studentGainincomeGain

    25.069.094.0)(),()( ageEnpIageGain

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    49/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 49/52

    Splitting Based on INFO: Gain Ratio

    Gain Ratio:

    Parent Node, p is split into k partitions

    niis the number of records in partition i

    Adjusts Information Gain by the entropy of the partitioning(SplitINFO). Higher entropy partitioning (large number of smallpartitions) is penalized!

    Used in C4.5

    Designed to overcome the disadvantage of Information Gain

    SplitINFO

    GAINGainRATIO Split

    spli t

    k

    i

    ii

    n

    n

    n

    nSplitINFO

    1log

    Splitting Criteria based on

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    50/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 50/52

    Splitting Criteria based onClassification Error

    Classification error at a node t :

    Measures misclassification error made by a node. Maximum (1 - 1/nc) when records are equally distributed

    among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying

    most interesting information

    )|(max1)( tiPtError i

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    51/52

    Data Mining Arif Djunaidy FTIF ITS Bab 7-1 - 51/52

    Examples for Computing Error

    C1 0

    C2 6

    C1 2

    C2 4

    C1 1

    C2 5

    P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

    Error = 1max (0, 1) = 11 = 0

    P(C1) = 1/6 P(C2) = 5/6

    Error = 1max (1/6, 5/6) = 15/6 = 1/6

    P(C1) = 2/6 P(C2) = 4/6

    Error = 1max (2/6, 4/6) = 14/6 = 1/3

    )|(max1)( tiPtErrori

  • 8/12/2019 Bab 07 - Class & Predict - Part 1.ppt

    52/52

    Akhir

    Bab 7 Part 1