Hitter Performance Project

Does being on a more experienced team influence or

increase the performance of rookie players in Major League

Baseball?

Adam Rothstein Jesse Cox

May 3, 2016

Abstract

This paper examines the effect of a new MLB player’s team experience on their performance in

subsequent years. Using MLB statistics for players and for teams from the years 2013 and 2014

we took a sample of rookie players entering the league no earlier than 2011 and used a simple

multiple regression and a logistic regression to determine both the amount and probability of an

increase in skill and proficiency as measured by their respective batting averages. We find that

our models were not highly telling models for the subject matter. We need a bigger, more

comprehensive data set so that we can drop variables that skew results such as a negative change

in batting average.

1

1 Introduction/Literature Review

Drafting players for any major league sport is the highlight of the preseason for loyal

followers of sports and carries a lot of weight for the possible performance of the team that year.

Being able to adequately pick and draft players that are not only top performers in their league

now but have the potential to become even better is pivotal. But what about the current

performance of a team? We speculate that it is not entirely up to the player and their respective

potential to become a better player but possibly the player’s experience and notoriety around

them.

We know that when a player joins the MLB their batting average is lower than it will be

at their career high (Sommers). Sommers finds that a player’s batting average increases from the

point they join the MLB until they hit a career maximum at which point it slowly declines and

that this process takes from 79 years. Sommers’s study took into account the possibility of

injury and minimum atbats that tend to skew these results. Our study is focused on how a

player’s respective team can influence their improvement or better yet the rate at which they

improve.

In Horowitz’s study finds that MLB team owners want teams to be evenly matched

because close wins and losses or rather the potential of a close match drives ticket sales which is

2

a major source of profit for them (Horowitz). They also find that the talent disparities have been

much lower in recent years (Horowitz). Finally they conclude that competition did not drive

performance to its peak (Horowitz). We disagree with this on the basis that even though they

found talent disparities to be less than previous years we see the same teams making it to the

playoffs year after year. We also disagree with the finding that competition does not drive

improvement. We theorize that a more experienced team will have a positive spillover effect on

new players and that the drive to perform up to par is higher and that this will cause them to

improve at a faster rate.

2 Data

We used MLB baseball statistics from baseballreference.com. We took a sample of

players from the 2013 season who had been drafted for their first year in the MLB. After

obtaining the players who made their first year’s appearance we excluded all the players who

were over the age of 25 to try to better capture younger players who still have more room for

improvement. We used their batting average during this season as their starting batting average.

Once we had narrowed our list to these players we found them on the 2014 roster and pulled

their batting averages for 2014.

There were a total of 30 MLB teams that we used for this assignment all of which are

represented within our sample of players. This opened up a lot of variability between teams.

With this we expected different teams to have many different batting averages, runs, playoff

3

wins and world series wins. We used the data from the 162 2014 regular season games that every

team participates in. In addition to these 2014 regular season games we used data from the 2010,

2011, 2012, 2013, and 2014 playoff and world series games to try to capture the notoriety or

longer term experience a team has to offer.

2.1 Dependent Variables

We used two main dependent variables in our modeling and analysis. We began by first

using just one dependent variable, Percentage Change in BA which was just the change in

batting average for our sample of rookie players from 20132014. It didn’t take long before we

saw that not every player saw an increase in batting average as we had assumed from our

predictions and from the literature on the topic of batting average changes over time. To try and

correct for this we created a dummy variable for whether or not the change in batting average

was positive or negative and called this variable Positive Change, leaving us with two

dependent variables for our modeling.

2.2 Explanatory Variables

Average Batting Age on Team: The average batting age on each respective team is just the

average age of those who bat on the team. We think this will be significant because older players

have, presumably, been in the league longer and therefore have more experience in their

4

profession. This experience in turn should translate into a greater improvement in individual

performance. The mean age for batters on MLB teams in 2014 was about 28 with a standard

deviation of 1.3. The range is from 25 to about 30 which gives a rather good spread for possibly

explaining some of the change and player performance (Figure 1).

Team Batting Average: This variable is the player’s 2014 MLB team batting average. We

realize that the batting averages of individual players influence the determination of this variable

but each team consists of enough players we feel the minute amount of correlation isn’t enough

to throw off our results. The mean team batting average for MLB teams in 2014 was 28.1 with a

relatively low standard deviation of only about 1.3. This means that the spread of all teams only

lies within a range of about 8 (Figure 1).

Team Average Run per Game: This variable is the average runs per game that the MLB team

scored over the course of the 2014 regular season. Our thinking is that a team that scores more

runs per game will consist of better players and a higher batting average overall. The summary

statistics for this variable showed a mean of 4 with a standard deviation of only .3 so it doesn’t

seem that it will be very significant (Figure 1).

Team Total Season Runs 2014: Total team runs was the total runs each team had per season.

For the same reason we thought average runs per game would be significant we thought this

would be too. While the average runs per game had a relatively poor spread the total season runs

5

had a mean of about 616 and a standard deviation of 52 which makes for a much better spread

than runs per game (Figure 1).

Team Rank: The team rank variable is a numerical value of 15 that corresponds to each team’s

respective ranking within their division. There are five teams per division and six divisions in all.

The average rank was three which is above the expected value for an even spread which could be

indicative of better teams recruiting more rookie players (in 2013 at least) (Figure 1).

Team Win Percentage: The team’s win percentage is calculated by just dividing the team’s

number of regular season wins by 160 (the number of regular season games each team plays in

per year). With a spread of 5% and a mean 48% the range is large enough to predict significance

(Figure 1).

Post Season Performance: The better teams in the league perform in the postseason and have a

chance to compete in the World Series. We thought this would be a very interesting variable to

include but since teams that are better performers often make multiple and consecutive

appearances which would suggest a certain notoriety associated with some teams. In an attempt

to capture the possible effect of this we created a variable which would show performance in the

postseason. If a team ended the regular season without making an appearance in the playoffs,

they received a value of 1. If a team made it to the playoffs but lost, they received a value of 2. If

a team made a World Series appearance, but lost, they received a value of 3. Finally the team

that won the World Series, received a value of 4. We looked at the five most recent years: 2010,

6

2011, 2012, 2013, 2014, and came up with an aggregate total for each team in respect to their

performance. A team that never made it to the playoffs and thus would have the lowest score for

each year, would have a value of 5. If a team won the World Series all five years, they would

have a value of 20. Of course, in these years the same team did not win all five years so the range

for this variable is from 5 to 14 with a standard deviation of 2 (Figure 1).

3 Problems with Data

Once we formulated summary statistics for the data we compiled we found that there was

not always a positive change in a player’s batting average (Figure 1). This makes sense because

of the pretty low range and standard deviation of batting averages in general. Not every player

will experience a dramatic improvement in all of their first years consistently. To try to correct

for this we decided to take only the players who had a positive change in batting average and use

them in a logistic regression. We felt that doing this would enable us to capture a percentage

increase or correlation with our explanatory variables that would be able to stand apart from just

a basic bivariate regression.

3.1 Experiment

7

We used two econometric models in our study to determine the effect of experience on

individual player development. The first was a simple multivariate regression of our explanatory

variables against the first dependent variable, Percentage Change in BA.

0 1AvgBAge 2TmBA 3TmAvgR/G 4TmTR 5TmRank 6TmW% 7PSeasonY = β + β + β + β + β + β + β + β + Σ

The second was a logistic regression in which we regressed our explanatory variables

against our dummy variable, Positive Change. With this we attempted to associated the

percentage of influence each variable may have associated with the possible change in batting

average.

ogit(Y ) 0 1AvgBAge 2TmBA 3TmAvgR/G 4TmTR 5TmRank 6TmW%l = β + β + β + β + β + β + β

7PSeason+ β + Σ

4 Results

After running a multivariate regression using OLS estimators on the percentage change in

batting average for individual players on our explanatory values we received an output that had

an R of only .0984. We found all explanatory variables to be significant within the regression 2

except for the player’s team’s rank and win percentage. Variables that were significant no higher

than the .10 level were team runs per game, team batting average, postseason performance and

total team runs. The rest of the variables were significant at the .05 level or better (Figure 4).

Team batting age was the most significant variable and holds a negative coefficient.

8

Our logistic model also had a very low pseudo value at .0862. Again we have R 2

explanatory variables that are not significant: the player’s team’s rank, the team’s batting

average, average team batting age, and the team’s win percentage. The remaining variables were

significant at the .05 level or better (Figure 5).

5 Main Findings/Marginal Effects

Within the first model we see that average batting age, batting average and total runs all

had a negative correlation with percentage change in the player’s batting average. The average

team batting average goes against our prediction that older players have a spillover effect on

newer players because of their experience. Team batting average and total runs in the season also

go against our prediction that team who performs better in the season will produce higher

performing rookie players. We do find that there is a positive relationship with runs per game

and postseason performance. This does correlate with our predictions because a team that scores

more runs in a game also hits more balls in a game and teams who consistently make it to the

postseason have an influence on their newer players.

In our logistic model we saw only three significant variables: team runs, team runs per

game and postseason performance. The total team runs in the season has a negative effect on the

odds of a player having a positive change in batting average while runs per game and a better

performance in the postseason had a positive effect on the odds of a player’s batting average

increasing. This does not go along with what we predicted. We would have liked to see an

9

increase in odds with older players and team’s batting average to better illustrate the possible

effect of experience and competition.

6 Conclusion

We believe that the variables that were not significant in the multivariate regression

(team rank and team win percentage) were so because they were correlated to each other. The

team with the highest win percentage was the the top rank within their league. If we used the win

percentage within their division and then used the rank overall for the entire league we may have

seen better results.

The majority of the variables in the logistic regression that were not significant are again

a problem with the win percentage and the team rank being correlated. If we had access or the

time to compile a bigger data set for this experiment we feel we would have been able to drop

the observations that were giving us more issues.

To refine this study a larger and cleaner data set is required. We only took players from

one year across one season but if we could take all the rookie players that entered across various

years and possibly their change over two years instead of just one to try to weed out the negative

values that threw us off early on. The problem with putting together a larger data set for this

information is that it is time consuming since all the data is not in the same place due to players

being traded and switching teams from season to season.

Another way this sampling technique can be applied to other studies on the effectiveness

of rule changes or policy changes within the MLB. For example, drug policies were made more

10

strict in 2004 so a sample of all new players from 2004 to date taken in this manner could be

compared to a sample of the ten years prior to the new rules. Since younger players are less risk

averse and have much more pressure on them to perform they are more likely to try performance

enhancing drugs and it would be interesting to see if the rule changes in the MLB had an effect

on the rate or level of improvement in batting average.

11

7 References

Albert, J. (2010). Baseball Data at Season, PlaybyPlay, and PitchbyPitch Levels. Retrieved

May 2, 2016, from http://www.amstat.org/publications/jse/v18n3/albert.pdf

Bradbury, J. C., & Drinen, D. J. (2008, April). Pigou at the Plate. Retrieved May 02, 2016, from

http://jse.sagepub.com/content/9/2/211.abstract

Furnald, N. A., & O'Hara, M. E. (2012, April 16). The Impact of Age on Baseball Player s ’

Performance. Retrieved May 2, 2016, from

http://www.colgate.edu/portaldata/imagegallerywww/21c0d00240984995941f9ae8013632ee/

ImageGallery/2012/theimpactofageonbaseballplayersperformance.pdf

Horowitz, I. (2000, May). The Impact of Competition on Performance Disparities in

Organizational Systems. Retrieved May 02, 2016, from

https://ideas.repec.org/a/sae/jospec/v1y2000i2p151176.html

Houser, A. (2005). Which Baseball Statistic Is the Most Important When Determining Team

Success? Retrieved May 2, 2016, from https://www.iwu.edu/economics/PPE13/houser.pdf

Sommers, P. M. (2008). The Changing Hitting Performance Profile in Major League Baseball,

19662006. Retrieved May 02, 2016, from

12

https://www.iwu.edu/economics/PPE13/houser.pdf

http://www.researchgate.net/publication/247739825_The_Changing_Hitting_Performance_Profil

e_in_Major_League_Baseball_19662006

13

8 Appendix

Figure 1 Summary Statistics

14

Figure 2

15

Figure 3

16

Figure 4 Multivariate Regression Output

17

Figure 5 Logistic Regression Output

18

Documents

Hitter Performance Project