Upload
ashish-pandey
View
215
Download
0
Embed Size (px)
Citation preview
8/18/2019 IPL Cluster
1/6
The information provided on these pages r emains, unless otherwise stated, the copyright of the r espective authors. All layout , design, original graphics, concepts an d
other World Wide Web Intellectual Property Rights barring the information mentioned above, re mains the property and copyright of OrangeTree Globa l
Case Study on Cluster Analysis
In this case study we want to perform a cluster analysis to form homogeneous groups of players based
on their past performances. We have 435 players. We have also 13 variables based on which the clusters
will be formed. The explanation to the variable names is provided below.
VARIABLES DESCRIPTION
Mat Matches played
Inns Innings batted
Not Out Not outs
Runs Runs scored
HS Highest inns scored
Ave Batting average
BF Balls facedSR Batting strike rate
100s Scored of hundreds
50s Scored of Fifties
DUCK Ducks scored
1. We first create a library, named ‘ipl’. Then we import the data using the following syntaxes:
libname ipl "E:\ARUP\cluster";run; proc import datafile="E:\ARUP\cluster\cluster _data.csv" out=ipl.data
dbms=csv replace;getnames=yes;datarow=2;run;
2. Then we rename few variables to make it easier for future use. We also
create a pdf document ‘variables’ which contains the variable names.
data ipl.data;set ipl.data;rename _00s=hundreds _0s=fifties _s=fours var14=sixes;run;
ods pdf body="E:\ARUP\cluster\variables.pdf"; proc contents data=ipl.data position short;run;ods pdf close;
3. To do a cluster analysis we have to standardize the dataset. For that we set mean to zero and
standard deviation to one for each variable (excluding the variable ‘player’).
proc standard data=ipl.data mean=0 std=1 out=ipl.standard;var Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;run;
8/18/2019 IPL Cluster
2/6
The information provided on these pages r emains, unless otherwise stated, the copyright of the r espective authors. All layout , design, original graphics, concepts an d
other World Wide Web Intellectual Property Rights barring the information mentioned above, re mains the property and copyright of OrangeTree Globa l
4. Now we will use ward’s method along with cubic clustering criterion to
create the output dataset ‘cluter_tree’. We use the keyword ‘outtree’
to create this dataset. Here ‘id’ represents the variable ‘player’
which will be gouped based on the remaining variables.
ods pdf body="E:\ARUP\cluster\result_1.pdf"; proc cluster data=ipl.standard outtree=ipl.cluster_tree method=ward ccc;id Player;var Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;run;ods pdf close;
Let us take a look at the eigen values.
The first three eigen values are more than one.the cumulative variation
explained by the first three variables is.8120 i.e. 81.2%. We may include the
fourth variable also to get the variance explained by 87.08%.
5.
We will now use the dataset ‘cluster_tree’, created by the outtreestatement to generate a dendogram (tree diagram) to in out the number
of clusters.
ods pdf body="E:\ARUP\cluster\result_2.pdf"; proc tree data=ipl.cluster_tree;id Player;copy Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;run;ods pdf close;
Following is the output.
8/18/2019 IPL Cluster
3/6
The information provided on these pages r emains, unless otherwise stated, the copyright of the r espective authors. All layout , design, original graphics, concepts an d
other World Wide Web Intellectual Property Rights barring the information mentioned above, re mains the property and copyright of OrangeTree Globa l
In y-axis we have the semi partial R squared values and along x-axis we have the players’ names. If we
draw a horizontal line at level 0.2 (or more) then it intersects two vertical lines. So we get two clusters.
At level 0.4 we have only one cluster i.e. all the players in one group. Certainly this is not desirable. At
level 0.08 the horizontal line will intersect four vertical lines. If we take this level as the desired one then
we get four clusters.
6. After deciding upon the number of clusters we move onto the next step. We run the following
set of statements.
ods pdf body= "E:\ARUP\cluster\result_3.pdf"; proc tree data=ipl.cluster_tree nclusters=4 out=ipl.cluster;id Player;copy Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;run;ods pdf close;
In this code we use the ‘nclusters=4’ option to get only four clusters. ‘out’ is the dataset ‘cluster’ which
contains all the standardized variables along with the players’ names and the corresponding cluster
numbers. The output also provides the same tree diagram.
We are done with creating clusters. Then we will move onto some further analysis based on the clusters
so created.
8/18/2019 IPL Cluster
4/6
The information provided on these pages r emains, unless otherwise stated, the copyright of the r espective authors. All layout , design, original graphics, concepts an d
other World Wide Web Intellectual Property Rights barring the information mentioned above, re mains the property and copyright of OrangeTree Globa l
7. We create a new dataset ‘final_cluster’ which contains only the players’ names and the
corresponding custer numbers.
data ipl.final_clusters;set ipl.cluster (keep= Player cluster);
run;
8. Now we will merge the original dataset named ‘data’ with the new dataset ‘final_cluster’. The
newly created dataset is ‘clustered_data’. At first we sort both the datasets by the common
column ‘Player’. We use the following set of statements.
proc sort data=ipl.final_clusters;by Player;run;
proc sort data=ipl.data;by Player;run;
data ipl.clustered_data;merge ipl.data(in=x) ipl.final_clusters(in=y);by player;if x=1 and y=1;run;
9. To have a look at the mean, standard deviations and some other summary statistics of each
cluster, we run a proc means statements using a class variable ‘cluster’.
ods pdf body= "E:\ARUP\cluster\clusterwise_summary.pdf"; proc means data=ipl.clustered_data n mean std skewness kurtosis;var Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;
class cluster;run;ods pdf close;
From the following figure we see that the four clusters have 193, 171, 52 and
19 players respectively. The averages of all the variables are also provided
below.
Cluster four seems to have an excellent grouping of players in the sense that
all the players have very high averages for all the variables.
8/18/2019 IPL Cluster
5/6
The information provided on these pages r emains, unless otherwise stated, the copyright of the r espective authors. All layout , design, original graphics, concepts an d
other World Wide Web Intellectual Property Rights barring the information mentioned above, re mains the property and copyright of OrangeTree Globa l
10. We can also see the players belonging to the cluster four using the code:
8/18/2019 IPL Cluster
6/6
The information provided on these pages r emains, unless otherwise stated, the copyright of the r espective authors. All layout , design, original graphics, concepts an d
other World Wide Web Intellectual Property Rights barring the information mentioned above, re mains the property and copyright of OrangeTree Globa l
ods pdf body="E:\ARUP\cluster\cluster_4.pdf"; proc print data=ipl.clustered_data (keep =cluster player);
where cluster=4;run;ods pdf close;
Below is the output.