Upload
prashant-mudgal
View
326
Download
0
Embed Size (px)
Citation preview
Predictive Modeling Practical guide to data science and model building under less than 60 min
Prashant Mudgal
Introduction
Predictive modeling and data science are said to be most attractive subjects and related jobs are said to be hottest jobs of the twenty first century. The same was quotes in an article in Harvard Business Review few months ago. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
There are thousands of articles available online and same number of books that teach data science and give theory of various related topics such as Linear algebra, probability, optimization, machine learning, calculus. This brief work is aimed in the same direction with focus on implementation on fairly sizable dataset. It focuses on cleaning the data, visualization, EDA, feature scaling, feature normalization, k-nearest neighbor, logistic regression, random forests, cross validation without delving too deep into any of them but giving a start to a new learner.
Problem Statement and Data
The name of the bank is Banco de Portugal Website https://www.bportugal.pt/en-US/Pages/inicio.aspx
Problem statement : Predict using mathematical methods whether a customer a customer subscribed to a term deposit loan.
Data has been taken from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
Citation : Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems,
In press, http://dx.doi.org/10.1016/j.dss.2014.03.001
Available at: [pdf] http://dx.doi.org/10.1016/j.dss.2014.03.001
[bib] http://www3.dsi.uminho.pt/pcortez/bib/2014-dss.txt
1. Title: Bank Marketing (with social/economic context)
2. Sources : Sérgio Moro (ISCTE-IUL), Paulo Cortez (Univ. Minho) and Paulo Rita (ISCTE-IUL) @ 2014
3. Past Usage: The full dataset (bank-additional-full.csv) was described and analyzed in: S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems (2014), doi:10.1016/j.dss.2014.03.001.
4. Relevant Information: This dataset is based on "Bank Marketing" UCI dataset. The data is enriched by the addition of five new social and economic features/attributes (national wide indicators from a ~10M population country), published by the Banco de Portugal and publicly available at: https://www.bportugal.pt/estatisticasweb. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
Language and Packages
Given the rise of python and ease of usage, the models have been built in python. One can replicate the same on R or SAS.
Packages and libraries : Pandas, numpy, scikit-learn, matplotlib
Breaking Predictive analytics down
It is industry wide standard to spend approximately 80% of the time in EDA and data preparation.
This is the split of time spent only for the first model that is built.
5%
15%
40%
40%
Descriptive Analysis Data Treatment Model BuildingPerformance Estimation
Descriptive Analysis
Descriptive analysis of data deals with understanding the data we are using. It includes:
Identifying the predictor and the target variables. Dividing the dataset into test and training sets. Usually 25% of the dataset is kept as test and model is built on 75% of the data.
Univariate analysis for checking the spread of the variables. For numerical variables one uses measures of central tendency(mean, median, mode), measures of dispersion(IQR, Range, variance, standard deviation) or using visualization methods. For categorical/classification variables one can use frequency distribution and bar charts.
Bivariate analysis to check the association between the various variables. For continuous variables one can use scatter plots and then measure correlation coefficients.
-1: perfect negative linear correlation
+1:perfect positive linear correlation and
0: No correlation
For categorical and continuous Z-test and t-test(t-test for small datasets), ANOVA(analysis of variance) for checking whether two samples from dataset are statistically different.
For categorical and categorical chi-sq test is the prime test. It is probability based approach and measures probability, probability of 1 means the variables are independent. Chi-sq test also measures the goodness of fit in regression models and multicollinearity.
Data Treatment
Data treatment includes:
Outlier treatment - log transformation
Missing value treatment - imputing using mean, median or mode
Feature Scaling - Limiting range of variables
Normalization of features
Label Encoding for categorical variables as models can’t work on String variables
Model Building
Most of the problems that are solved using machine learning techniques are classification problems.
For Numerical regression - Linear Regression
For categorical target variables - Logistic Regression
Tree based methods - Random Forest, Gradient Boosting
Performance Estimation and Improvements
Performance of the model can be checked using various methods
ROC-AUC curve (Receiving Operating Characteristic)
Checking accuracy score
Confusion Matrix
Measuring Root mean squared error
To reduce the model error, one can use k-fold Cross validation methods. If there are a large number of predictors then one can select the best features using selectKBest techniques. The process is called dimensionality reduction. Principal Component Analysis, PCA, is another powerful way to reduce the number of dimensions and check for multicollinearity.
iPython Notebook https://github.com/Prashantmdgl9/Predictive-Analytics
Import the libraries and read the data. One should take care of the delimiters; though the data is in CSV format, it is semicolon delimited. Check the number of rows and decide how much you want to keep in training and test. For now, don’t divide the data as any cleansing activity that needs to be done would be done on complete data frame.
Look at the data in detail, use the head, describe and column functions in the pandas library to take a closer look.
The data doesn’t have any ID column, let’s add an ID column and check various data types.
The data frame has int and string type variables. Let’s check whether there are any columns with missing data and identify the ID and target variables. Also, separate the numeric and the categorical columns.
For purpose of description, we will deal with missing data. Impute numeric values with mean and categorical with -9999
Let’s scale the features, feature scaling is done to limit the range of the variables so they can be compared. It is done for continuous variables. Before doing so, let’s plot histograms of the numeric variables and check the range and distribution.
We see that many variables have entirely different ranges and scaling might help. Before proceeding we should encode our categorical variables as they are String objects and we can’t build models on String variables unless converted.
Let’s scale the data frame using MinMax scaler and fit kNN. Here we also have predicted the accuracy of the model.
We can normalize the data using scale function in scikit-learn library and perform a logistic regression on normalized data
We can proceed with machine learning algorithms and random forest would be our algorithm of our choice because of its enhanced performance.
To reduce the bias variance error, we should use k-fold cross validation that would help us alleviate the problem.
Feature selection is one of the ways to reduce the number of predictors and include only those which cause most variance in the predictions of the model. It can be achieved by using Principal component analysis or using selectKbest feature of scikit-learn.
Let’s plot the p-values of the features and form short list of best features.
Let’s form our new list according to highest p-values and fit random forest with cross validation on the training set with k = 10 and then fit on test data.
Conclusion
The step above concludes our model building exercise. We started with data exploration and data preparation and went on to use complex methods such as kNN, cross validation and random forests to arrive at the final model and results. Depending on the type of dataset, one has to add or remove few steps but the gist remains the same - explore, treat, build and improve. The steps above can be used to build starting model on any type of dataset and that would give decent accuracy.
* If you want to download the python notebook of the project then you can visit
https://github.com/Prashantmdgl9/Predictive-Analytics