39
Santander Product Recommendation

3차프로젝트 스페인 santander 은행 고객 데이터를 활용한 금융상품 추천

Embed Size (px)

Citation preview

VISION-POWERPOINT-TEMPALTES

Santander Product Recommendation2016.12.13Korean Wave(, , )

Contents1. 2. 3. 4. SEMMA Process

2

1.

Project Objective

Kaggle Santander Santander Product RecommendationCompetition Santander Product Recommendation

,

?2015.01 - 2016.05Data2016.06Prediction

#

4

Introduce Our Team

Kaggle ! Korean Wave , , EDA, , , EDA, , &,

#

5

2.

Data Introduction Santander 2015 1 ~ 2016 5 : Training set2016 6 : Test set 2GB13,647,309 48( 24 / 24 )

#Feature Description fecha_datodate ncodperscust_id ind_empleadoemp_index pais_residenciacust_country sexosexageagefecha_altacust_firstdate ind_nuevonew_cust 6 antiguedadcust_seni indrelcust_pri ult_fec_cli_1tcust_pri_date indrel_1mescust_type tiprel_1mescust_relation_type indresiresidence_index indextforeigner_index conyuempspouse_index canal_entradachannel indfalldeceased_index tipodomtipodom cod_provlocation_code nomprovlocation_name ind_actividad_clienteactivity_index rentaincome segmentosegment

#Feature Description ind_ahor_fin_ult1Saving Account ind_aval_fin_ult1Guaranteesind_cco_fin_ult1Current Accountsind_cder_fin_ult1Derivada Account ind_cno_fin_ult1Payroll Accountind_ctju_fin_ult1Junior Account ind_ctma_fin_ult1Ms particular Account 1ind_ctop_fin_ult1particular Account 2ind_ctpp_fin_ult1particular Plus Account 3ind_deco_fin_ult1Short-term depositsind_deme_fin_ult1Medium-term depositsind_dela_fin_ult1Long-term depositsind_ecue_fin_ult1e-account ind_fond_fin_ult1Fundsind_hip_fin_ult1Mortgageind_plan_fin_ult1Pensionsind_pres_fin_ult1Loansind_reca_fin_ult1Taxes ind_tjcr_fin_ult1Credit Cardind_valo_fin_ult1Securitiesind_viv_fin_ult1Home Accountind_nomina_ult1Payrollind_nom_pens_ult1Pensionsind_recibo_ult1Direct Debit

#

9

3.

Analysis Environmnet

Ubuntu 16.04 LTSAmazon EC2R 3.3.1Rstudio Server

vCPU: 4Memory: 30.5 GB(RAM)SSD : 80GB

2GBDesktop & Laptop

2GB Desktop Laptop R

#

11

4. SEMMA Process

4-1. Sampling

Sampling:

Sampling DataPartitioning Data Data(nrow=2,852,306)Training Data2015-01 ~ 2016-04Test Data2016-052015 1 2016 4Trainig Data , 2016 6 !(nrow=194,663)(nrow=2,657,643)

Raw Data Data Set (nrow=13,647,309)956,645 200,000 Sampling

Sampling

#Sampling: Test Data

TestData194,663

194,64815Saving Account

194,6594Guarantees

77,290117373Current Accounts

179,75614907PayrollAccount

193,045 1618JuniorAccount

Test Data(2016.05) 4 Test Data 24 , 7 Sampling

#4-2.Exploring

Exploring2050304060708090100

150000100000500000

FemalemaleUnknown15000010000050000

agesex18~30 40~50

VIPIndividualscollege graduatedunknown15000010000050000, , VIP segment

15000010000050000~ 5051~100101~150151~200200~seniority 50

Exploring

Exploring

Fond

Long Term Deposit

Junior Account

Current Account

SegmentVIPIndividualscollege graduatedunknown Long-term deposit, Fund VIP Junior Account Current Account

(Segment) Exploring

Exploring

seniority~ 5051~100101~150151~200200~GuaranteesParticular Account

Current Account

Mas particular Account

Current Account 50 Mas particular Account 51~100 Particular Account 151~200 Guarantees 200

(Seniority) Exploring

Exploring

age~2526~3536~4546~5556~6566~

Junior Account

MortgageSaving AccountGuarantees

Junior Account 25 Saving Account, Mortgage 36~45 Guarantees 40

(age) Exploring

4-3. Modify

Data Cleansing: Modify

SET (NULL) Average, Median, Frequency UNKNOWN

agenew customerindexSeniorityCust_priJoinning DateGrossIncomeEtc. 01020304050607 6 new customer NULL Minimum First/PrimaryNULL NULLMedian Medain 41 Data Cleansing, UNKNOWN

#

22

Feature Engineering:

(Age, Gross income, Seniority)Modify, ,

#

23

Feature Engineering:

Level Level Dummy Modify

Level Level Dummy Variable Age: 6 level

Gross_income:4 level

Seniority:5 level Province name(nomprov): 5 level

Channel:5 level

Reidence(pais_residencia):9 level Date: 17

Province name(nomprov):53

Residence(pais_residencia):93

#

24

Feature Engineering:

Modify

()=> + ()=> //, , Clustering ()=> 6 Level () () / / / ()

#

25

Feature Engineering:

( ) Modify[1] Mortgage => Direct Debit 0.0061232 0.8603142 5.960670[2] Payroll => Pensions 0.0625038 1.0000000 14.813015[3] Pensions => Payroll 0.0625038 0.9258697 14.813015[4] Payroll => Payroll Account 0.0592032 0.9471936 10.451423[5] Pensions => Payroll Account 0.0638628 0.9460006 10.438259ConfidenceSupportLift. . . ()1Direct DebitCredit Card,Pensions2PayrollPensions 3Payroll AccountPayroll 4PensionsPayroll

#

26

4-4. Modeling

27

Feature Choice: Near Zero Variance

0 ( ) pais_residencia ult_fec_cli_1tPrimary conyuemp indfall

Near Zero Variance ( 0 )Modeling

#

28

Feature Choice:

,

Province name(nomprov): 5 level

Channel:5 level

Reidence(pais_residencia):9 level Date: 17

Province name(nomprov):53

()=> + ()=> //, , Clustering ()=> 6 Level () () / / / ()

: 20

Modeling

#

29

Modeling Process

Modeling()TrainingModelPrediction(Test set)EvaluationModel

Test Set

EnsembleModel(Stacking Model)XGBoost, Random Forest SMOTE SVM Nave BayesRandom ForestXGBoost

#

30

Model Description

( , )

SVM XGBoost Naive BayesModeling

#

31

Model Choice

Recall ( ) 6 Recall ModelingSVMXGBoost Naive Bayes

FN + TPTPRecall :