IPL Cluster

8/18/2019 IPL Cluster

1/6

The information provided on these pages r emains, unless otherwise stated, the copyright of the r espective authors. All layout , design, original graphics, concepts an d

other World Wide Web Intellectual Property Rights barring the information mentioned above, re mains the property and copyright of OrangeTree Globa l

Case Study on Cluster Analysis

In this case study we want to perform a cluster analysis to form homogeneous groups of players based

on their past performances. We have 435 players. We have also 13 variables based on which the clusters

will be formed. The explanation to the variable names is provided below.

VARIABLES DESCRIPTION

Mat Matches played

Inns Innings batted

Not Out Not outs

Runs Runs scored

HS Highest inns scored

Ave Batting average

BF Balls facedSR Batting strike rate

100s Scored of hundreds

50s Scored of Fifties

DUCK Ducks scored

1. We first create a library, named ‘ipl’. Then we import the data using the following syntaxes:

libname ipl "E:\ARUP\cluster";run; proc import datafile="E:\ARUP\cluster\cluster _data.csv" out=ipl.data

dbms=csv replace;getnames=yes;datarow=2;run;

2. Then we rename few variables to make it easier for future use. We also

create a pdf document ‘variables’ which contains the variable names.

data ipl.data;set ipl.data;rename _00s=hundreds _0s=fifties _s=fours var14=sixes;run;

ods pdf body="E:\ARUP\cluster\variables.pdf"; proc contents data=ipl.data position short;run;ods pdf close;

3. To do a cluster analysis we have to standardize the dataset. For that we set mean to zero and

standard deviation to one for each variable (excluding the variable ‘player’).

proc standard data=ipl.data mean=0 std=1 out=ipl.standard;var Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;run;


2/6



4. Now we will use ward’s method along with cubic clustering criterion to

create the output dataset ‘cluter_tree’. We use the keyword ‘outtree’

to create this dataset. Here ‘id’ represents the variable ‘player’

which will be gouped based on the remaining variables.

ods pdf body="E:\ARUP\cluster\result_1.pdf"; proc cluster data=ipl.standard outtree=ipl.cluster_tree method=ward ccc;id Player;var Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;run;ods pdf close;

Let us take a look at the eigen values.

The first three eigen values are more than one.the cumulative variation

explained by the first three variables is.8120 i.e. 81.2%. We may include the

fourth variable also to get the variance explained by 87.08%.

5.

We will now use the dataset ‘cluster_tree’, created by the outtreestatement to generate a dendogram (tree diagram) to in out the number

of clusters.

ods pdf body="E:\ARUP\cluster\result_2.pdf"; proc tree data=ipl.cluster_tree;id Player;copy Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;run;ods pdf close;

Following is the output.


3/6



In y-axis we have the semi partial R squared values and along x-axis we have the players’ names. If we

draw a horizontal line at level 0.2 (or more) then it intersects two vertical lines. So we get two clusters.

At level 0.4 we have only one cluster i.e. all the players in one group. Certainly this is not desirable. At

level 0.08 the horizontal line will intersect four vertical lines. If we take this level as the desired one then

we get four clusters.

6. After deciding upon the number of clusters we move onto the next step. We run the following

set of statements.

ods pdf body= "E:\ARUP\cluster\result_3.pdf"; proc tree data=ipl.cluster_tree nclusters=4 out=ipl.cluster;id Player;copy Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;run;ods pdf close;

In this code we use the ‘nclusters=4’ option to get only four clusters. ‘out’ is the dataset ‘cluster’ which

contains all the standardized variables along with the players’ names and the corresponding cluster

numbers. The output also provides the same tree diagram.

We are done with creating clusters. Then we will move onto some further analysis based on the clusters

so created.


4/6



7. We create a new dataset ‘final_cluster’ which contains only the players’ names and the

corresponding custer numbers.

data ipl.final_clusters;set ipl.cluster (keep= Player cluster);

run;

8. Now we will merge the original dataset named ‘data’ with the new dataset ‘final_cluster’. The

newly created dataset is ‘clustered_data’. At first we sort both the datasets by the common

column ‘Player’. We use the following set of statements.

proc sort data=ipl.final_clusters;by Player;run;

proc sort data=ipl.data;by Player;run;

data ipl.clustered_data;merge ipl.data(in=x) ipl.final_clusters(in=y);by player;if x=1 and y=1;run;

9. To have a look at the mean, standard deviations and some other summary statistics of each

cluster, we run a proc means statements using a class variable ‘cluster’.

ods pdf body= "E:\ARUP\cluster\clusterwise_summary.pdf"; proc means data=ipl.clustered_data n mean std skewness kurtosis;var Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;

class cluster;run;ods pdf close;

From the following figure we see that the four clusters have 193, 171, 52 and

19 players respectively. The averages of all the variables are also provided

below.

Cluster four seems to have an excellent grouping of players in the sense that

all the players have very high averages for all the variables.


5/6



10. We can also see the players belonging to the cluster four using the code:


6/6



ods pdf body="E:\ARUP\cluster\cluster_4.pdf"; proc print data=ipl.clustered_data (keep =cluster player);

where cluster=4;run;ods pdf close;

Below is the output.

Documents

IPL Cluster