15
Predictive Modeling Practical guide to data science and model building under less than 60 min Prashant Mudgal

Predictive modeling

Embed Size (px)

Citation preview

Page 1: Predictive modeling

Predictive Modeling Practical guide to data science and model building under less than 60 min

Prashant Mudgal

Page 2: Predictive modeling

Introduction

Predictive modeling and data science are said to be most attractive subjects and related jobs are said to be hottest jobs of the twenty first century. The same was quotes in an article in Harvard Business Review few months ago. https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/

There are thousands of articles available online and same number of books that teach data science and give theory of various related topics such as Linear algebra, probability, optimization, machine learning, calculus. This brief work is aimed in the same direction with focus on implementation on fairly sizable dataset. It focuses on cleaning the data, visualization, EDA, feature scaling, feature normalization, k-nearest neighbor, logistic regression, random forests, cross validation without delving too deep into any of them but giving a start to a new learner.

Page 3: Predictive modeling

Problem Statement and Data

The name of the bank is Banco de Portugal Website https://www.bportugal.pt/en-US/Pages/inicio.aspx

Problem statement : Predict using mathematical methods whether a customer a customer subscribed to a term deposit loan.

Data has been taken from https://archive.ics.uci.edu/ml/datasets/Bank+Marketing

Citation : Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems,

In press, http://dx.doi.org/10.1016/j.dss.2014.03.001

Available at: [pdf] http://dx.doi.org/10.1016/j.dss.2014.03.001

[bib] http://www3.dsi.uminho.pt/pcortez/bib/2014-dss.txt

1. Title: Bank Marketing (with social/economic context)

2. Sources : Sérgio Moro (ISCTE-IUL), Paulo Cortez (Univ. Minho) and Paulo Rita (ISCTE-IUL) @ 2014

3. Past Usage: The full dataset (bank-additional-full.csv) was described and analyzed in: S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems (2014), doi:10.1016/j.dss.2014.03.001.

4. Relevant Information: This dataset is based on "Bank Marketing" UCI dataset. The data is enriched by the addition of five new social and economic features/attributes (national wide indicators from a ~10M population country), published by the Banco de Portugal and publicly available at: https://www.bportugal.pt/estatisticasweb. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

Page 4: Predictive modeling

Language and Packages

Given the rise of python and ease of usage, the models have been built in python. One can replicate the same on R or SAS.

Packages and libraries : Pandas, numpy, scikit-learn, matplotlib

Breaking Predictive analytics down

It is industry wide standard to spend approximately 80% of the time in EDA and data preparation.

This is the split of time spent only for the first model that is built.

5%

15%

40%

40%

Descriptive Analysis Data Treatment Model BuildingPerformance Estimation

Page 5: Predictive modeling

Descriptive Analysis

Descriptive analysis of data deals with understanding the data we are using. It includes:

Identifying the predictor and the target variables. Dividing the dataset into test and training sets. Usually 25% of the dataset is kept as test and model is built on 75% of the data.

Univariate analysis for checking the spread of the variables. For numerical variables one uses measures of central tendency(mean, median, mode), measures of dispersion(IQR, Range, variance, standard deviation) or using visualization methods. For categorical/classification variables one can use frequency distribution and bar charts.

Bivariate analysis to check the association between the various variables. For continuous variables one can use scatter plots and then measure correlation coefficients.

-1: perfect negative linear correlation

+1:perfect positive linear correlation and

0: No correlation

For categorical and continuous Z-test and t-test(t-test for small datasets), ANOVA(analysis of variance) for checking whether two samples from dataset are statistically different.

For categorical and categorical chi-sq test is the prime test. It is probability based approach and measures probability, probability of 1 means the variables are independent. Chi-sq test also measures the goodness of fit in regression models and multicollinearity.

Data Treatment

Data treatment includes:

Outlier treatment - log transformation

Missing value treatment - imputing using mean, median or mode

Feature Scaling - Limiting range of variables

Normalization of features

Label Encoding for categorical variables as models can’t work on String variables

Page 6: Predictive modeling

Model Building

Most of the problems that are solved using machine learning techniques are classification problems.

For Numerical regression - Linear Regression

For categorical target variables - Logistic Regression

Tree based methods - Random Forest, Gradient Boosting

Performance Estimation and Improvements

Performance of the model can be checked using various methods

ROC-AUC curve (Receiving Operating Characteristic)

Checking accuracy score

Confusion Matrix

Measuring Root mean squared error

To reduce the model error, one can use k-fold Cross validation methods. If there are a large number of predictors then one can select the best features using selectKBest techniques. The process is called dimensionality reduction. Principal Component Analysis, PCA, is another powerful way to reduce the number of dimensions and check for multicollinearity.

Page 7: Predictive modeling

iPython Notebook https://github.com/Prashantmdgl9/Predictive-Analytics

Import the libraries and read the data. One should take care of the delimiters; though the data is in CSV format, it is semicolon delimited. Check the number of rows and decide how much you want to keep in training and test. For now, don’t divide the data as any cleansing activity that needs to be done would be done on complete data frame.

Look at the data in detail, use the head, describe and column functions in the pandas library to take a closer look.

Page 8: Predictive modeling

The data doesn’t have any ID column, let’s add an ID column and check various data types.

Page 9: Predictive modeling

The data frame has int and string type variables. Let’s check whether there are any columns with missing data and identify the ID and target variables. Also, separate the numeric and the categorical columns.

Page 10: Predictive modeling

For purpose of description, we will deal with missing data. Impute numeric values with mean and categorical with -9999

Let’s scale the features, feature scaling is done to limit the range of the variables so they can be compared. It is done for continuous variables. Before doing so, let’s plot histograms of the numeric variables and check the range and distribution.

Page 11: Predictive modeling

We see that many variables have entirely different ranges and scaling might help. Before proceeding we should encode our categorical variables as they are String objects and we can’t build models on String variables unless converted.

Page 12: Predictive modeling

Let’s scale the data frame using MinMax scaler and fit kNN. Here we also have predicted the accuracy of the model.

We can normalize the data using scale function in scikit-learn library and perform a logistic regression on normalized data

Page 13: Predictive modeling

We can proceed with machine learning algorithms and random forest would be our algorithm of our choice because of its enhanced performance.

To reduce the bias variance error, we should use k-fold cross validation that would help us alleviate the problem.

Page 14: Predictive modeling

Feature selection is one of the ways to reduce the number of predictors and include only those which cause most variance in the predictions of the model. It can be achieved by using Principal component analysis or using selectKbest feature of scikit-learn.

Let’s plot the p-values of the features and form short list of best features.

Page 15: Predictive modeling

Let’s form our new list according to highest p-values and fit random forest with cross validation on the training set with k = 10 and then fit on test data.

Conclusion

The step above concludes our model building exercise. We started with data exploration and data preparation and went on to use complex methods such as kNN, cross validation and random forests to arrive at the final model and results. Depending on the type of dataset, one has to add or remove few steps but the gist remains the same - explore, treat, build and improve. The steps above can be used to build starting model on any type of dataset and that would give decent accuracy.

* If you want to download the python notebook of the project then you can visit

https://github.com/Prashantmdgl9/Predictive-Analytics