28
1 Predicting the winner of C.Y. award 指指指指 指指指指指 指指指指 指指指 指指指

1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

  • View
    259

  • Download
    9

Embed Size (px)

Citation preview

Page 1: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

1

Predicting the winner of C.Y. award

指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

Page 2: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

2

Introduction

Baseball sport in Taiwan CPBL (Chinese Professional Baseball League)

MLB (Major League Baseball) Baseball sport in USA

Cy Young Award since 1956 Baseball Writers Association of America Weighted scores Each league has one winner per year.

Page 3: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

3

Measurements There are no definite rules be used to judge. Nevertheless, many measurements could be used

to judge whether a pitcher is good or not. Wins ERA WHIP G/F

etc.

Page 4: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

4

Aim of the study To analysis the historical statistics of pitchers. Building a predictive model. To predict the Cy Young Award winner of the

year in the future.

Page 5: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

5

Data mining procedure

Ten data mining methodology steps

Page 6: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

6

Step 1 : Translate the Problem Directed data mining problem

Target variable: Cy Young Award Classification Decision tree

Purposes Gambling game Predictive activities

Page 7: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

7

Step 2 : Select Appropriate Data Just MLB statistics data (1871 ~ 2006)

Cy Young Award: 1956 ~ 2006 total 21456 records List of Cy Young Award winners

“Time” factor 1999 as the dividing year.

Because of the emerging items.

Variables: to remove the items that are not representative of a pitcher.

Page 8: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

8

Step 3 : Get to know the data The materials that we used all come from

MLB official site These data have already been disclosed for a

lot of years The quality of data is very good some attributes has value since 1999

Page 9: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

9

Step 4 : Create a model set We divide the data into training data and

testing data We do not create a balanced sample The record of MLB is not the seasonal

materials we will pick the materials since 1999

Page 10: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

10

Step 5 : Fix problems with the data These data are taken from MLB official side No missing values single source

Page 11: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

11

Step 6 : Transform data to bring information to the surface There are no combinations of attributes We delete some attributes We add a attribute-Year We add a attribute (CyYoungAward_Winner)

for classification

Page 12: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

12

Step 7 : Build Models Tools Used Weka Crash Problem Blank Attributes Build Model Handling Blank Attributes

Page 13: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

13

Tools Used

Page 14: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

14

Weka Crash Problem Raw data

21456 data instances 42 attributes

Weka crashed during model construction Give Weka more memory

Page 15: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

15

Blank Attributes

Page 16: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

16

Build Model MLB 1956~2006

with blank attributes ADTree

MLB 1956~2006 without blank attributes ADTree

MLB 1999~2006 ADTree

Page 17: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

17

Handling Blank Attributes

Page 18: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

18

1956~2006, with blank attributes, ADTree

Page 19: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

19

1956~2006, with blank attributes, ADTree

=== Confusion Matrix ===

NONWINNER WINNER <-- classified as

21343 21 NONWINNER

58 34 WINNER

Page 20: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

20

1956~2006, without blank attributes, ADTree

Page 21: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

21

1956~2006, without blank attributes, ADTree

=== Confusion Matrix ===

NONWINNER WINNER <-- classified as

21350 14 NONWINNER

62 30 WINNER

Page 22: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

22

1999~2006, ADTree

Page 23: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

23

1999~2006, ADTree

=== Confusion Matrix ===

NONWINNER WINNER <-- classified as

5090 3 NONWINNER

13 3 WINNER

Page 24: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

24

Not good enough for gambling

Step 8 : Assess Models(1/2)

=== Confusion Matrix ===

NONWINNER WINNER <-- classified as

21350 14 NONWINNER

62 30 WINNER

=== Confusion Matrix ===

NONWINNER WINNER <-- classified as

5090 3 NONWINNER

13 3 WINNER

Page 25: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

25

Step 8 : Assess Models(2/2) Some attributes are more important

Number of Appearance of Attributes in Different ModelsW BB WPCT OBA WHIP K/9 ERA GF

1956~2006ADTree

2 3 1 1

1956~2006 Without Blank AttributesADTree

2 1 1 1 1 1

1999~2006ADTree

2 1 1 1 1

1956~2006 Without Blank AttributesJ48

3 2 1 1

Page 26: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

26

Step 9 : Deploy Models To implement a computer program with the

built model. To predict the Cy Young Award winner more

easily.

Page 27: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

27

Step 10 : Assess Results To compare the predictive and the final Cy

Young Award winner directly. Not “business” but “interest”.

Assessment from the judgment of the person.

Page 28: 1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖

28

Conclusions

We have used the classification technology to set up the model of predicting

We find the accuracy of the built model is not high

Some factors that we are not to consider It can not use in the place with essential

benefits Just for fun