View
499
Download
1
Category
Preview:
Citation preview
BA 682 데이터마이닝(Data Mining)
20123820 강준현
I. 유니버설 뱅크 데이터를 사용한 로지스틱 회귀분석 모델 구축
Data Exploration
Income Family Size
Credit Card Avg Education
Data Exploration
ID Age Experience Income ZIP Code Family CCAvg Education MortgagePersonal LoanSecurities AccountCD Account Online CreditCard
ID 1
Age -0.00847 1
Experience -0.00833 0.994215 1
Income -0.01769 -0.05527 -0.04657 1
ZIP Code 0.013432 -0.02922 -0.02863 -0.01641 1
Family -0.0168 -0.04642 -0.05256 -0.1575 0.011778 1
CCAvg -0.02467 -0.05203 -0.05009 0.645993 -0.00407 -0.10928 1
Education 0.021463 0.041334 0.013152 -0.18752 -0.01738 0.064929 -0.13614 1
Mortgage -0.01392 -0.01254 -0.01058 0.206806 0.007383 -0.02044 0.109909 -0.03333 1
Personal Loan -0.0248 -0.00773 -0.00741 0.502462 0.000107 0.061367 0.366891 0.136722 0.142095 1
Securities Account -0.01697 -0.00044 -0.00123 -0.00262 0.004704 0.019994 0.015087 -0.01081 -0.00541 0.021954 1
CD Account -0.00691 0.008043 0.010353 0.169738 0.019972 0.01411 0.136537 0.013934 0.089311 0.316355 0.317034 1
Online -0.00253 0.013702 0.013898 0.014206 0.01699 0.010354 -0.00362 -0.015 -0.00599 0.006278 0.012627 0.175880016 1
CreditCard 0.017028 0.007681 0.008967 -0.00239 0.007691 0.011588 -0.00669 -0.01101 -0.00723 0.002802 -0.01503 0.278644365 0.00421 1
Age Experience
Data Dimension Reduction
Principal Components
Variable 1 2 3 4 5
Age 0.01554224 0.70662385 0.08264883 0.54202592 -0.44701025
Experience 0.01338275 0.70728117 -0.07694203 -0.54321295 0.44561642
Income -0.99977607 0.02043895 0.00444289 0.00272685 0.00167583
Family 0.0039179 -0.0041944 0.98944801 -0.1447722 0.00090257
Education 0.00342416 0.00085297 0.09067582 0.6246289 0.77563143
Variance 2119.944092 261.3847961 1.2882899 0.89438522 0.53321916
Variance% 88.92215729 10.96392059 0.05403799 0.03751545 0.02236615
Cum% 88.92215729 99.88607788 99.94011688 99.97763062 99.99999237
Components
Data Processing
Age와 Experience중에 Experience만을 변수에 포함 시키기로 결정
Experience 중에 음수 값을 갖는 데이터들은 삭제 (52개 데이터)
Nominal 변수인 ID와 Zip code도 변수에서 제외
데이터 세트를 60:40 비율로 Training set와 Validation set로 임의 분할
XLMiner : Data Partition Sheet(Ver:
12.5.3E)
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage Personal Loan Securities AccountCD Account Online CreditCard
ID Age Experience Income ZIP Code Family CCAvg Education MortgagePersonal
Loan
Securities
AccountCD Account Online CreditCard
1 1 25 1 49 91107 4 1.60 1 0 0 1 0 0 0
4 4 35 9 100 94112 1 2.70 2 0 0 0 0 0 0
5 5 35 8 45 91330 4 1.00 2 0 0 0 0 0 1
6 6 37 13 29 92121 4 0.40 2 155 0 0 0 1 0
9 9 35 10 81 90089 3 0.60 2 104 0 0 0 1 0
10 10 34 9 180 93023 1 8.90 3 0 1 0 0 0 0
12 12 29 5 45 90277 3 0.10 2 0 0 0 0 1 0
17 17 38 14 130 95010 4 4.70 3 134 1 0 0 0 0
18 18 42 18 81 94305 4 2.40 1 0 0 0 0 0 0
19 19 46 21 193 91604 2 8.10 3 0 1 0 0 0 0
20 20 55 28 21 94720 1 0.50 2 0 0 1 0 0 1
21 21 56 31 25 94015 4 0.90 2 111 0 0 0 1 0
23 23 29 5 62 90277 1 1.20 1 260 0 0 0 1 0
Date: 31-Oct-2013 21:57:13
Output Navigator
Training Data Validation Data Test Data
Data
Data source Data!$A$5:$N$5004
Selected variables
Partitioning Method Randomly chosen
Random Seed 12345
# training row s 3000
# validation row s 2000
Row Id.
Selected variables
ID Age Experience Income ZIP Code Family CCAvg Education Mortgage
Personal
Loan
Securities
Account
CD
Account Online CreditCard
2619 23 -3 55 92704 3 2.40 2 145 0 0 0 1 0
3627 24 -3 28 90089 4 1.00 3 0 0 0 0 0 0
4286 23 -3 149 93555 2 7.20 1 0 0 0 0 1 0
4515 24 -3 41 91768 4 1.00 3 0 0 0 0 1 0
316 24 -2 51 90630 3 0.30 3 0 0 0 0 1 0
452 28 -2 48 94132 2 1.75 3 89 0 0 0 1 0
598 24 -2 125 92835 2 7.20 1 0 0 1 0 0 1
794 24 -2 150 94720 2 2.00 1 0 0 0 0 1 0
890 24 -2 82 91103 2 1.60 3 0 0 0 0 1 1
2467 24 -2 80 94105 2 1.60 3 0 0 0 0 1 0
2718 23 -2 45 95422 4 0.60 2 0 0 0 0 1 1
2877 24 -2 80 91107 2 1.60 3 238 0 0 0 0 0
2963 23 -2 81 91711 2 1.80 2 0 0 0 0 0 0
3131 23 -2 82 92152 2 1.80 2 0 0 1 0 0 1
3797 24 -2 50 94920 3 2.40 2 0 0 1 0 0 0
3888 24 -2 118 92634 2 7.20 1 0 0 1 0 1 0
4117 24 -2 135 90065 2 7.20 1 0 0 0 0 1 0
4412 23 -2 75 90291 2 1.80 2 0 0 0 0 1 1
4482 25 -2 35 95045 4 1.00 3 0 0 0 0 1 0
90 25 -1 113 94303 4 2.30 3 0 0 0 0 0 1
227 24 -1 39 94085 2 1.70 2 0 0 0 0 0 0
525 24 -1 75 93014 4 0.20 1 0 0 0 0 1 0
537 25 -1 43 92173 3 2.40 2 176 0 0 0 1 0
541 25 -1 109 94010 4 2.30 3 314 0 0 0 1 0
577 25 -1 48 92870 3 0.30 3 0 0 0 0 0 1
584 24 -1 38 95045 2 1.70 2 0 0 0 0 1 0
650 25 -1 82 92677 4 2.10 3 0 0 0 0 1 0
671 23 -1 61 92374 4 2.60 1 239 0 0 0 1 0
687 24 -1 38 92612 4 0.60 2 0 0 0 0 1 0
910 23 -1 149 91709 1 6.33 1 305 0 0 0 0 1
1174 24 -1 35 94305 2 1.70 2 0 0 0 0 0 0
1429 25 -1 21 94583 4 0.40 1 90 0 0 0 1 0
1523 25 -1 101 94720 4 2.30 3 256 0 0 0 0 1
1906 25 -1 112 92507 2 2.00 1 241 0 0 0 1 0
2103 25 -1 81 92647 2 1.60 3 0 0 0 0 1 1
2431 23 -1 73 92120 4 2.60 1 0 0 0 0 1 0
2546 25 -1 39 94720 3 2.40 2 0 0 0 0 1 0
2849 24 -1 78 94720 2 1.80 2 0 0 0 0 0 0
2981 25 -1 53 94305 3 2.40 2 0 0 0 0 0 0
3077 29 -1 62 92672 2 1.75 3 0 0 0 0 0 1
3158 23 -1 13 94720 4 1.00 1 84 0 0 0 1 0
3280 26 -1 44 94901 1 2.00 2 0 0 0 0 0 0
3285 25 -1 101 95819 4 2.10 3 0 0 0 0 0 1
3293 25 -1 13 95616 4 0.40 1 0 0 1 0 0 0
3395 25 -1 113 90089 4 2.10 3 0 0 0 0 1 0
3426 23 -1 12 91605 4 1.00 1 90 0 0 0 1 0
3825 23 -1 12 95064 4 1.00 1 0 0 1 0 0 1
3947 25 -1 40 93117 3 2.40 2 0 0 0 0 1 0
4016 25 -1 139 93106 2 2.00 1 0 0 0 0 0 1
4089 29 -1 71 94801 2 1.75 3 0 0 0 0 0 0
4583 25 -1 69 92691 3 0.30 3 0 0 0 0 1 0
4958 29 -1 50 95842 2 1.75 3 0 0 0 0 0 1
# Records in the training data 2969
Validation data ['UniversalBank_Logistic
# Records in the validation data 1979
Data
Training data used for building the model ['UniversalBank_Logistic
Logistic Regression
Set confidence level 95%
Best subset selection: Exhaustive search
1 2 3 4 5 6 7 8 9 10 11
2 3183.687744 219.7645264 0 Constant Income * * * * * * * * *
3 3100.568359 138.6170197 0 Constant Income Education * * * * * * * *
4 3039.182617 79.21051788 0 Constant Income Education CD Account * * * * * * *
5 2990.404297 32.41569901 0.00001378 Constant Income Family Education CD Account * * * * * *
6 2981.825928 25.83442879 0.00019169 Constant Income Family Education CD Account Online * * * * *
7 2972.419922 18.42524338 0.00426458 Constant Income Family Education CD Account Online CreditCard * * * *
8 2964.378662 12.38126278 0.06197376 Constant Income Family EducationSecurities Account CD Account Online CreditCard * * *
9 2959.063965 9.06476879 0.35692081 Constant Income Family CCAvg EducationSecurities Account CD Account Online CreditCard * *
10 2957.014404 9.01451492 0.9041543 Constant Experience Income Family CCAvg EducationSecurities Account CD Account Online CreditCard *
11 2957 11.00010586 1 Constant Experience Income Family CCAvg Education MortgageSecurities Account CD Account Online CreditCard
#Coeffs RSS Cp ProbabilityModel (Constant present in all models)
Logistic Regression
Best subset selection
1 2 3 4 5 6 7 8 9 10
Choose Subset 2 3212.504639 217.5792542 0 Constant Income * * * * * * * *
Choose Subset 3 3108.05542 115.0950928 0 Constant Income Education * * * * * * *
Choose Subset 4 3059.695313 68.71881104 0 Constant Income Education CD Account * * * * * *
Choose Subset 5 3019.44751 30.45754433 0.00001909 Constant Income Family Education CD Account * * * * *
Choose Subset 6 3008.855957 21.86244774 0.00062266 Constant Income Family Education CD Account CreditCard * * * *
Choose Subset 7 2999.336426 14.33973217 0.01660405 Constant Income Family Education CD Account Online CreditCard * * *
Choose Subset 8 2994.993164 11.99501705 0.05081629 Constant Income Family EducationSecurities Account CD Account Online CreditCard * *
Choose Subset 9 2991.976807 10.97765064 0.08504453 Constant Income Family CCAvg EducationSecurities Account CD Account Online CreditCard *
Choose Subset 10 2989.000244 10.00009251 1 Constant Experience Income Family CCAvg EducationSecurities Account CD Account Online CreditCard
#Coeffs RSS Cp ProbabilityModel (Constant present in all models)
The Regression Model
Coefficient Std. Error p-value Odds
-13.981266 0.8205896 0 8.47253E-07 8.16778E-07 8.77729E-07
0.01231669 0.00861601 0.15285712 1.01239288 0.99544007 1.02963436
0.05771863 0.00359579 0 1.05941689 1.0519768 1.06690955
0.73089772 0.0997033 0 2.07694435 1.70827281 2.52518058
0.13275796 0.05343928 0.01298148 1.14197361 1.02841508 1.26807117
1.72314227 0.15280582 0 5.60210371 4.15224123 7.55822325
0.00008745 0.00073139 0.90482515 1.0000875 0.99865484 1.00152206
-1.11542439 0.39214 0.00444875 0.32777616 0.15198027 0.70691556
4.12018013 0.4292928 0 61.5703392 26.54341698 142.8190918
-0.79940081 0.20818347 0.00012309 0.44959828 0.29896376 0.67613083
-0.95952284 0.2626732 0.00025928 0.38307562 0.22892682 0.64102113
Education
Mortgage
Securities Account
CD Account
Online
CreditCard
Constant term
Experience
Income
Family
CCAvg
Input variables 95% Confidence Interval
Performance Evaluation
Training Data scoring - Summary Report
0.5
Actual Class 1 0
1 179 107
0 43 2640
Class # Cases # Errors % Error
1 286 107 37.41
0 2683 43 1.60
Overall 2969 150 5.05
Classification Confusion Matrix
Predicted Class
Error Report
Cut off Prob.Val. for Success (Updatable) ( Updating the value here w ill NOT update value in detailed report )
Validation Data scoring - Summary Report
0.5
Actual Class 1 0
1 130 64
0 30 1755
Class # Cases # Errors % Error
1 194 64 32.99
0 1785 30 1.68
Overall 1979 94 4.75
Predicted Class
Error Report
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Training Data scoring - Summary Report
0.3
Actual Class 1 0
1 213 73
0 93 2590
Class # Cases # Errors % Error
1 286 73 25.52
0 2683 93 3.47
Overall 2969 166 5.59
Classification Confusion Matrix
Predicted Class
Error Report
Cut off Prob.Val. for Success (Updatable)
Training Data scoring - Summary Report
0.2
Actual Class 1 0
1 235 51
0 148 2535
Class # Cases # Errors % Error
1 286 51 17.83
0 2683 148 5.52
Overall 2969 199 6.70
Classification Confusion Matrix
Predicted Class
Error Report
Cut off Prob.Val. for Success (Updatable)
Performance Evaluation
0
50
100
150
200
250
0 1000 2000 3000
Cu
mu
lati
ve
# cases
Lift chart (validation dataset)
CumulativePersonal Loanwhen sortedusing predictedvalues
CumulativePersonal Loanusing average 0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8 9 10
De
cile
me
an /
Glo
bal
me
an
Deciles
Decile-wise lift chart (validation dataset)
0
50
100
150
200
250
300
350
0 1000 2000 3000 4000
Cu
mu
lati
ve
# cases
Lift chart (training dataset)
CumulativePersonal Loanwhen sortedusing predictedvalues
CumulativePersonal Loanusing average 0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7 8 9 10
De
cile
me
an /
Glo
bal
me
an
Deciles
Decile-wise lift chart (training dataset)
II. 비행 연착 데이터를 활용한 나이브 베이즈 모델 구축
Data Exploration
0 delayed
0 ontime
1 delayed
Weather
0
10
20
30
40
50
60
70
80
90
delayed delayed delayed delayed delayed delayed delayed
1 2 3 4 5 6 7
0
200
400
600
800
1000
1200
1400
delayed ontime delayed ontime delayed ontime
BWI DCA IAD
0
200
400
600
800
1000
1200
delayed ontime delayed ontime delayed ontime
EWR JFK LGA
Week
Origin Destination
Data Exploration
0
50
100
150
200
250
300
350
400
450d
elay
ed
on
tim
e
del
ayed
on
tim
e
del
ayed
on
tim
e
del
ayed
on
tim
e
del
ayed
on
tim
e
del
ayed
on
tim
e
del
ayed
on
tim
e
del
ayed
on
tim
e
CO DH DL MQ OH RU UA US
CO delayed
CO ontime
DH delayed
DH ontime
DL delayed
DL ontime
MQ delayed
MQ ontime
OH delayed
OH ontime
RU delayed
RU ontime
UA delayed
UA ontime
US delayed
US ontime
Carrier
Data Exploration
Scheduled departure time
600 -700 4% 700-800
5% 800-900
6% 900-1000
3%
1000-1100 3%
1100-1200 1%
1200 -1300 5%
1300- 1400 5%
1400-1500 15%
1500-1600 9%
1600-1700 8%
1700-1800 15%
1800-1900 3%
1900-2000 9%
2000-2100 2% 2100-2200
8%
Data Processing
출발 시간이 10, 109로 600 ~ 2200 범위 를 벗어나는 아웃라이어로
판단하고 데이터 삭제
Scheduled departure time 데이터를 16개의 time block으로 재구성
예측 상황에서 미리 주어 질 수 없는 실제 비행기 출발 시간, 워싱턴 DC와
뉴욕 구간이기 때문에 모두 비슷한 수준 (평균 211.87, 중앙값 214,
최빈값 214, 표준 편차 13.31)이기 때문에 분석 변수에서 제외
명목형 변수인 tail number와 flight number 분석 변수에서 제외
비행 날짜는 요일에 비해 추후 예측에 활용할 여지가 적기 때문에 분석
변수에서 제외
데이터 세트를 60:40 비율로 Training set와 Validation set로 임의 분할
Naïve Bayes
Value Prob Value Prob
CO 0.036312849 CO 0.06122449
DH 0.231843575 DH 0.306122449
DL 0.188081937 DL 0.118367347
MQ 0.118249534 MQ 0.163265306
OH 0.013035382 OH 0.012244898
RU 0.174115456 RU 0.244897959
UA 0.016759777 UA 0.004081633
US 0.22160149 US 0.089795918
EWR 0.273743017 EWR 0.387755102
JFK 0.176908752 JFK 0.187755102
LGA 0.549348231 LGA 0.424489796
BWI 0.057728119 BWI 0.102040816
DCA 0.645251397 DCA 0.502040816
IAD 0.297020484 IAD 0.395918367
0 1 0 0.930612245
1 0 1 0.069387755
Mon 0.131284916 Mon 0.220408163
Tue 0.14990689 Tue 0.130612245
Wed 0.148044693 Wed 0.151020408
Thur 0.181564246 Thur 0.130612245
Fri 0.170391061 Fri 0.159183673
Sat 0.111731844 Sat 0.069387755
Sun 0.10707635 Sun 0.13877551
600-700 0.058659218 600-700 0.032653061
700-800 0.055865922 700-800 0.053061224
800-900 0.082867784 800-900 0.06122449
900-1000 0.047486034 900-1000 0.016326531
1000-1100 0.044692737 1000-1100 0.032653061
1100-1200 0.040968343 1100-1200 0.016326531
1200-1300 0.0716946 1200-1300 0.065306122
1300-1400 0.083798883 1300-1400 0.048979592
1400-1500 0.090316574 1400-1500 0.146938776
1500-1600 0.067970205 1500-1600 0.085714286
1600-1700 0.081005587 1600-1700 0.07755102
1700-1800 0.104283054 1700-1800 0.13877551
1800-1900 0.044692737 1800-1900 0.028571429
1900-2000 0.047486034 1900-2000 0.089795918
2000-2100 0.019553073 2000-2100 0.024489796
2100-2200 0.058659218 2100-2200 0.081632653
CARRIER
DEST
ORIGIN
Weather
DAY_WEEK
Binned_CRS_
DEP_TIME
Conditional probabilities
Classes-->
ontime delayedInput
Variables
RU (Continental Express Airline)를 타고 수요일 15:00 ~ 16:00 출발 IAD에서 LGA로 갈 경우 (기상은 양호함) Ontime = 0.81*0.174 * 0.148 * 0.068 * 0.297 * 0.549 *1
0.00022971 Delay = 0.186* 0.245* 0.424 * 0.396 * 0.151* 0.0857 *0.931
0.0000092
Ontime 확률 = 0.00022971 / (0.00022971 + 0.0000092)
96% (Cutoff value 50%를 넘으므로 ontime으로 분류)
Prior class probabilities
Prob.
0.814253222
0.185746778delayed
<-- Success Class
According to relative occurrences in training data
Class
ontime
Performance Evaluation Training Data scoring - Summary Report
0.5
Actual Class ontime delayed
ontime 1049 25
delayed 205 40
Class # Cases # Errors % Error
ontime 1074 25 2.33
delayed 245 205 83.67
Overall 1319 230 17.44
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Error Report
Validation Data scoring - Summary Report
0.5
Actual Class ontime delayed
ontime 685 14
delayed 155 26
Class # Cases # Errors % Error
ontime 699 14 2.00
delayed 181 155 85.64
Overall 880 169 19.20
Classification Confusion Matrix
Predicted Class
Error Report
Cut off Prob.Val. for Success (Updatable)
Training Data scoring - Summary Report
0.3
Actual Class ontime delayed
ontime 1074 0
delayed 228 17
Class # Cases # Errors % Error
ontime 1074 0 0.00
delayed 245 228 93.06
Overall 1319 228 17.29
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Error Report
Training Data scoring - Summary Report
0.8
Actual Class ontime delayed
ontime 672 402
delayed 83 162
Class # Cases # Errors % Error
ontime 1074 402 37.43
delayed 245 83 33.88
Overall 1319 485 36.77
Cut off Prob.Val. for Success (Updatable)
Classification Confusion Matrix
Predicted Class
Error Report
Performance Evaluation
0
200
400
600
800
1000
1200
0 500 1000 1500
Cu
mu
lati
ve
# cases
Lift chart (training dataset)
Cumulative FlightStatus whensorted usingpredicted values
Cumulative FlightStatus usingaverage 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 2 3 4 5 6 7 8 9 10
De
cile
me
an /
Glo
bal
me
an
Deciles
Decile-wise lift chart (training dataset)
0
100
200
300
400
500
600
700
800
0 500 1000
Cu
mu
lati
ve
# cases
Lift chart (validation dataset)
Cumulative FlightStatus whensorted usingpredicted values
Cumulative FlightStatus usingaverage 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 2 3 4 5 6 7 8 9 10
De
cile
me
an /
Glo
bal
me
an
Deciles
Decile-wise lift chart (validation dataset)
End of presentation
Recommended