IPL Cluster

Embed Size (px)

Citation preview

  • 8/18/2019 IPL Cluster

    1/6

     

    The information provided on these pages r emains, unless otherwise stated, the copyright of the r espective authors. All layout , design, original graphics, concepts an d

    other World Wide Web Intellectual Property Rights barring the information mentioned above, re mains the property and copyright of OrangeTree Globa l

    Case Study on Cluster Analysis

    In this case study we want to perform a cluster analysis to form homogeneous groups of players based

    on their past performances. We have 435 players. We have also 13 variables based on which the clusters

    will be formed. The explanation to the variable names is provided below.

    VARIABLES DESCRIPTION

    Mat Matches played

    Inns Innings batted

    Not Out Not outs

    Runs Runs scored

    HS Highest inns scored

    Ave Batting average

    BF Balls facedSR Batting strike rate

    100s Scored of hundreds

    50s Scored of Fifties

    DUCK Ducks scored

    1.  We first create a library, named ‘ipl’. Then we import the data using the following syntaxes:  

    libname ipl "E:\ARUP\cluster";run; proc import datafile="E:\ARUP\cluster\cluster _data.csv" out=ipl.data

    dbms=csv replace;getnames=yes;datarow=2;run;

    2.  Then we rename few variables to make it easier for future use. We also

    create a pdf document ‘variables’ which contains the variable names. 

    data ipl.data;set ipl.data;rename _00s=hundreds _0s=fifties _s=fours var14=sixes;run;

    ods pdf body="E:\ARUP\cluster\variables.pdf"; proc contents data=ipl.data position short;run;ods pdf close;

    3.  To do a cluster analysis we have to standardize the dataset. For that we set mean to zero and

    standard deviation to one for each variable (excluding the variable ‘player’). 

     proc standard  data=ipl.data mean=0 std=1 out=ipl.standard;var Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;run;

  • 8/18/2019 IPL Cluster

    2/6

     

    The information provided on these pages r emains, unless otherwise stated, the copyright of the r espective authors. All layout , design, original graphics, concepts an d

    other World Wide Web Intellectual Property Rights barring the information mentioned above, re mains the property and copyright of OrangeTree Globa l

    4.  Now we will use ward’s method along with cubic clustering criterion to

    create the output dataset ‘cluter_tree’. We use the keyword ‘outtree’

    to create this dataset. Here ‘id’ represents the variable ‘player’

    which will be gouped based on the remaining variables.

    ods pdf body="E:\ARUP\cluster\result_1.pdf"; proc cluster data=ipl.standard outtree=ipl.cluster_tree method=ward ccc;id Player;var Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;run;ods pdf close;

    Let us take a look at the eigen values.

    The first three eigen values are more than one.the cumulative variation

    explained by the first three variables is.8120 i.e. 81.2%. We may include the

    fourth variable also to get the variance explained by 87.08%.

    5. 

    We will now use the dataset ‘cluster_tree’, created by the outtreestatement to generate a dendogram (tree diagram) to in out the number

    of clusters.

    ods pdf body="E:\ARUP\cluster\result_2.pdf"; proc tree data=ipl.cluster_tree;id Player;copy Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;run;ods pdf close;

    Following is the output.

  • 8/18/2019 IPL Cluster

    3/6

     

    The information provided on these pages r emains, unless otherwise stated, the copyright of the r espective authors. All layout , design, original graphics, concepts an d

    other World Wide Web Intellectual Property Rights barring the information mentioned above, re mains the property and copyright of OrangeTree Globa l

    In y-axis we have the semi partial R squared values and along x-axis we have the players’ names. If we

    draw a horizontal line at level 0.2 (or more) then it intersects two vertical lines. So we get two clusters.

    At level 0.4 we have only one cluster i.e. all the players in one group. Certainly this is not desirable. At

    level 0.08 the horizontal line will intersect four vertical lines. If we take this level as the desired one then

    we get four clusters.

    6.  After deciding upon the number of clusters we move onto the next step. We run the following

    set of statements.

    ods pdf body= "E:\ARUP\cluster\result_3.pdf"; proc tree data=ipl.cluster_tree nclusters=4 out=ipl.cluster;id Player;copy Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;run;ods pdf close; 

    In this code we use the ‘nclusters=4’ option to get only four clusters. ‘out’ is the dataset ‘cluster’ which

    contains all the standardized variables along with the players’ names and the corresponding cluster

    numbers. The output also provides the same tree diagram.

    We are done with creating clusters. Then we will move onto some further analysis based on the clusters

    so created.

  • 8/18/2019 IPL Cluster

    4/6

     

    The information provided on these pages r emains, unless otherwise stated, the copyright of the r espective authors. All layout , design, original graphics, concepts an d

    other World Wide Web Intellectual Property Rights barring the information mentioned above, re mains the property and copyright of OrangeTree Globa l

    7.  We create a new dataset ‘final_cluster’ which contains only the players’ names and the

    corresponding custer numbers.

    data ipl.final_clusters;set ipl.cluster (keep= Player cluster);

    run; 

    8.  Now we will merge the original dataset named ‘data’ with the new dataset ‘final_cluster’. The

    newly created dataset is ‘clustered_data’. At first we sort both the datasets by the common

    column ‘Player’. We use the following set of statements. 

     proc sort data=ipl.final_clusters;by Player;run;

     proc sort data=ipl.data;by Player;run;

    data ipl.clustered_data;merge ipl.data(in=x) ipl.final_clusters(in=y);by player;if x=1 and y=1;run; 

    9.  To have a look at the mean, standard deviations and some other summary statistics of each

    cluster, we run a proc means statements using a class variable ‘cluster’. 

    ods pdf body= "E:\ARUP\cluster\clusterwise_summary.pdf"; proc  means data=ipl.clustered_data n mean std skewness kurtosis;var Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;

    class cluster;run;ods pdf close;

    From the following figure we see that the four clusters have 193, 171, 52 and

    19 players respectively. The averages of all the variables are also provided

    below.

    Cluster four seems to have an excellent grouping of players in the sense that

    all the players have very high averages for all the variables.

  • 8/18/2019 IPL Cluster

    5/6

     

    The information provided on these pages r emains, unless otherwise stated, the copyright of the r espective authors. All layout , design, original graphics, concepts an d

    other World Wide Web Intellectual Property Rights barring the information mentioned above, re mains the property and copyright of OrangeTree Globa l

    10. We can also see the players belonging to the cluster four using the code:

  • 8/18/2019 IPL Cluster

    6/6

     

    The information provided on these pages r emains, unless otherwise stated, the copyright of the r espective authors. All layout , design, original graphics, concepts an d

    other World Wide Web Intellectual Property Rights barring the information mentioned above, re mains the property and copyright of OrangeTree Globa l

    ods pdf body="E:\ARUP\cluster\cluster_4.pdf"; proc  print data=ipl.clustered_data (keep =cluster player);

    where cluster=4;run;ods pdf close; 

    Below is the output.