SYS7002_HW

Embed Size (px)

Citation preview

  • 7/27/2019 SYS7002_HW

    1/7

    SYS 7002 Financial Engineering Homework

    Problem Description

    Situation

    Financial institutions have historical information regarding the outcome of previous customer loans.

    This information includes demographic factors and their credit performance (good or bad). This

    information can be used to create establish a credit scores and risk profiles. The risk models can be used

    to predict the performance of potential customers based on their relevant characteristics.

    Goal

    The objective of credit scoring is to be able to predict the probability that a borrower will default or not

    default on a credit line. This objective of this specific assignment is to answer the following:

    1) Calculate: (1) the log-odds score and (2) the naive Bayes score for the data set in the fileOWNAGERECDES.xls".

    2) Plot the ROC curve for each score.3) Which score is more discriminating?

    Approach

    Data

    The provided data is an excel file containing seven categorical variables (six predictor variables and one

    response) and two hundred observations. The predictor variable is if a default occurred bad or not

    good. There are one hundred good observations and one hundred bad observations. The

    remaining categorical predictor variables are binary values (0 or 1) indicating the housing status and age

    range. Table 1 is a summary of the variable labels.

    Table 1 Categorical Variable Labels

    Analysis and Evidence

    Using conditional counts of the data, establish a value for the probability of good or bad conditional on

    the data for each of the six predictor variables.

    Pr(Good|data) = p(G|x); Pr(Bad|data) = p(B|x)

    G=1/B=0 Own Rent Age:

  • 7/27/2019 SYS7002_HW

    2/7

    Use the above probabilities to determine the posterior odds of a good.

    o(G|x) = p(G|x) / p(B|x)

    In this case, since there are an equivalent number of good and bad observations, the population odds:

    p(G)/p(B) = 1

    From slide 28 of the lecture:

    [1]

    For this data set, the information odds, I(x), are equivalent to the posterior odds of a good, o(G|x).

    The weight of evidence in favor of good, wi(xi), is used to weight the characteristics, x, and establish a

    numerical value for the credit score, s(x).

    s(x) = w1*x1 + w2*x2 + wi*xi

    This summation of weighted characteristics is the naive Bayes score.

    The first ten rows of the attached spreadsheet illustrate the data and the calculated naive Bayes scores.

    Opop x I(x)

    O(own) 2.40

    O(rent) 0.53

    O(

  • 7/27/2019 SYS7002_HW

    3/7

    Where the score was calculated by a summation of the product of the predictive variables and the log

    odds value:

    0(0.88) + 1(-0.63) + 1(-1.39) + 0(-0.69) + 0(1.10) + 0(1.39) = -2.014903

    The log odds score is developed by using the counts (occurrences) for the conditionally independent

    characteristics age and residential status.

    These counts are used to generate a corresponding matrix of probabilities (fractions).

    The probabilities are used to generate a log odds score by taking the natural log of the posterior odds of

    a good.

    Log odds score = ln(o(G|x)) = ln(p(G|x) / p(B|x))

    G=1/B=0 Own Rent Age:

  • 7/27/2019 SYS7002_HW

    4/7

    A summary of both credit scores versus each demographic category:

    In order to create a ROC curve for each credit score, the rate of true positive and false positive defaults

    was calculated. This process involved:

    - Examining the range of values for each credit score- Selecting thresholds that ensured a change in rate- Perform a count of false positive, true positive, false negative, true negative occurrences for

    each threshold

    - Calculate the false positive and true positive rate- Plot TP and FP rate for each credit score

    The completed ROC curve is illustrated below along with the data tables. This is also in the attached

    spreadsheet.

    p(G|x)

    own/age

  • 7/27/2019 SYS7002_HW

    5/7

    Recommendation

    Based on the Summary of Scores table, the naive Bayes offers higher discrimination for the demographic

    categories considered.

    threshold LO_FP_rate LO_TP_rate FP TP FN TN

    1.7 0 0 0 0 100 100

    1.1 0.05 0.25 5 25 75 95

    0 0.25 0.85 25 85 15 75

    -1.7 0.35 0.9 35 90 10 65-1.9 0.65 0.95 65 95 5 35

    -2 1 1 100 100 0 0

    log odds scores

  • 7/27/2019 SYS7002_HW

    6/7

    Bonus:

    naive

    bins cum

    score|good score|bad score|good score|bad

    -2.1 0 0 0 0

    -2 5 30 5 30

    -1.3 5 35 10 65

    -0.5 5 10 15 75

    0.2 15 5 30 80

    0.5 15 5 45 85

    0.8 15 5 60 90

    2 15 5 75 95

    2.5 25 5 100 100

  • 7/27/2019 SYS7002_HW

    7/7

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    0 0.2 0.4 0.6 0.8 1

    True

    Positive

    Rate

    False Positive Rate

    ROC Curves

    log odds

    NAIVE