44
MongoBaseball.Net David Hoerster 2014

Mongo Baseball .NET

Embed Size (px)

DESCRIPTION

Using the MongoDB Aggregation Pipeline, we'll look at how we can calculate several baseball statistics, including a few SABRmetric stats.

Citation preview

Page 1: Mongo Baseball .NET

MongoBaseball.NetDavid Hoerster

2014

Page 2: Mongo Baseball .NET

About Me C# MVP (Since April 2011)

Sr. Director of Web Solutions at RGP

Conference Director for Pittsburgh TechFest

Co-Founder of BrainCredits (braincredits.com)

Past President of Pittsburgh .NET Users Group and organizer of recent Pittsburgh Code Camps and other Tech Events

Twitter - @DavidHoerster

Blog – http://geekswithblogs.net/DavidHoerster

Email – [email protected]

Page 3: Mongo Baseball .NET

Best of Both Worlds

+

Page 4: Mongo Baseball .NET

Assumptions Basic understanding of document databases, like Mongo

Familiarity of querying (non-aggregate pipeline) in Mongo

General understanding of baseball

Page 5: Mongo Baseball .NET

Baseball Statistics Basics of AVG, OBP, ERA have been around

Underground of advanced statistics been growing since early 70s Bill James is probably most well known

Society for American Baseball Research (SABR) Fosters the research of baseball statistical history

Stats like wOBA, wRAA, WAR, DIPS, NERD and more Lends itself to computer modeling and big data

Page 6: Mongo Baseball .NET

Basic Mongo Document database

A “NoSQL” solution

Wide range of querying and manipulation capabilities

Page 7: Mongo Baseball .NET

Querying Mongo Data Issue a JSON document

find and findOne like LINQ Select and First/Single methods

Basic cursor functionality (think DataReader)

Page 8: Mongo Baseball .NET

Mongo C# Driver Download as a NuGet package

Actively worked on and contributed to

There is an “official” client, along with several community clients

Page 9: Mongo Baseball .NET

Mongo’s Aggregation Pipeline MongoDB’s data aggregation solution

Modeled on the concept of data processing pipelines

Operations are performed in stages Results from one stage “piped” to the next stage

$match

$project

$sort

Page 10: Mongo Baseball .NET

Aggregation Pipeline Number of operations available

$group, $match, $project, $sort, $limit, $skip, $redact, $out, …

Essentially replaces the older mapReduce functionality Aggregation Pipeline provides better performance, generally mapReduce is more flexible

Aggregation combines a number of operations in order to produce a result set

Page 11: Mongo Baseball .NET

Aggregation Pipeline Maximum size of a returned document is 16 MB

Aggregation Pipeline now returns results using cursor (as of 2.6)

Each stage of a pipeline has a maximum limit of 100MB of RAM Enable allowDiskUse in order to write to disk and avoid this limitation

MongoDB will also optimize the pipeline, if possible

Page 12: Mongo Baseball .NET

Simple Single Purpose Op Count

Page 13: Mongo Baseball .NET

Simple Projection Batting Average (Hits / At Bats)

Page 14: Mongo Baseball .NET

Simple Projection Batting Average

Page 15: Mongo Baseball .NET

Simple Projection Batting Average in C#

Page 16: Mongo Baseball .NET

LINQ to Mongo Part of Mongo C# Driver

Implements find and findOne

Other grouping and projecting done client-side

Do you want all that data before manipulating it?

Page 17: Mongo Baseball .NET

Top 25 Batting Average in 2013 Add a $match pipeline operation

Page 18: Mongo Baseball .NET

Top 25 Batting Average in 2013 Now need to sort

Page 19: Mongo Baseball .NET

Top 25 Batting Average in 2013 But wait…we have incorrect results for top Batting Average

Need to enhance $match to include those with 3.1 PA for 162 games

Page 20: Mongo Baseball .NET

Top 25 Batting Average in 2013 In C# Using LINQ

Page 21: Mongo Baseball .NET

But… Not truly aggregation pipeline in C#

Done on client, not server

Materialize on client with LINQ

Must use BsonDocument for aggregation pipeline

Yikes!

Page 22: Mongo Baseball .NET

Aggregation Pipeline in C# Creating the $match BsonDocument

var match = new BsonDocument{ {"$match", new BsonDocument{ {"Year", 2013}, {"AtBats", new BsonDocument{ {"$gte", 502} }} }} };

Page 23: Mongo Baseball .NET

Aggregation Pipeline in C# Create the $project operation

var project = new BsonDocument { {"$project", new BsonDocument{ {"PlayerId", 1}, {"Year", 1}, {"TeamId", 1}, {"AVG", new BsonDocument{ {"$cond", new BsonDocument{ {"if", new BsonDocument{ {"$eq", new BsonArray{"$AtBats", "0"}} }}, {"then", 0}, {"else", new BsonDocument{ {"$divide", new BsonArray{"$Hits", "$AtBats"}} }} }} }} }} };

Page 24: Mongo Baseball .NET

Aggregation Pipeline in C# Create the $sort and $limit operations and then combine them all in an

Array var sort = new BsonDocument{ {"$sort", new BsonDocument{ {"AVG", -1} } } };

var limit = new BsonDocument{ {"$limit", 25} };

return new[] { match, project, sort, limit };

Page 25: Mongo Baseball .NET

Aggregation Pipeline in C# All the { } with BsonDocument and BsonArray reminds me of…

Page 26: Mongo Baseball .NET

On Base Percentage (OBP)

A measure of how often a batter reaches base for any reason other than a fielding error, fielder's choice, dropped/uncaught third strike, fielder's obstruction, or catcher's interference.

- Wikipedia (http://en.wikipedia.org/wiki/On-base_percentage)

Usually a better measure of batter’s performance than straight average

(H + BB + HBP) / (AB + BB + HBP + SF)

Page 27: Mongo Baseball .NET

On Base Percentage (OBP)(Hits + BB + HBP) / (AB + BB + HBP + SF)db.batting.aggregate([ {$match: { Year: 2013, AtBats: {$gte: 502} }}, {$project: { PlayerId: 1, Year: 1, TeamId: 1, OBP: { $cond: { if: {$eq: ["$AtBats", 0] }, then: 0, else: { $divide: [ {$add:["$Hits","$BaseOnBalls","$HitByPitch"]}, {$add:["$AtBats","$BaseOnBalls","$HitByPitch","$SacrificeFlies"]} ]} }} }}, {$sort: {OBP: -1}}, {$limit: 25}])

Page 28: Mongo Baseball .NET

On Base Percentage (OBP)

$match

$project

$sort

$limit

Page 29: Mongo Baseball .NET

Runs Created (Player)

Early SABRmetric type of stat, invented by Bill James

With regard to an offensive player, the first key question is how many runs have resulted from what he has done with the bat and on the basepaths. Willie McCovey hit .270 in his career, with 353 doubles, 46 triples, 521 home runs and 1,345 walks -- but his job was not to hit doubles, nor to hit singles, nor to hit triples, nor to draw walks or even hit home runs, but rather to put runs on the scoreboard. How many runs resulted from all of these things?

- Bill James (James, Bill (1985). The Bill James Historical Baseball Abstract (1st ed.), pp. 273-4. Villard. ISBN 0-394-53713-0)

((H + BB) x TB) / (AB + BB)

Aggregated across a team, RC is usually within 5% of a team’s actual runs

Page 30: Mongo Baseball .NET

Runs Created (Player)(Hits + Walks) * Total Bases / (At Bats + Walks)

db.batting.aggregate([ {$match: {Year:2013, AtBats:{$gte:502}}}, {$project: { PlayerId: 1, Year: 1, TeamId: 1, RC: { $divide: [ {$multiply: [ {$add: ["$Hits","$BaseOnBalls"]}, {$add: ["$Hits","$Doubles","$Triples","$Triples",

"$HomeRuns","$HomeRuns","$HomeRuns"] }]},

{ $add: ["$AtBats","$BaseOnBalls"] }] } }}, {$sort: {RC:-1}}, {$limit: 25}])

Page 31: Mongo Baseball .NET

Runs Created (Player)

$match

$project

$sort

$limit

Page 32: Mongo Baseball .NET

Runs Created (Team)db.batting.aggregate([ {$match: {Year:2013}}, {$group: { _id: "$TeamId", Hits: {$sum: "$Hits"}, Walks: {$sum: "$BaseOnBalls"}, Doubles: {$sum: "$Doubles"}, Triples: {$sum: "$Triples"}, HR: {$sum: "$HomeRuns"}, AtBats: {$sum: "$AtBats"} }}, {$project: { RC: { $divide: [ {$multiply: [ {$add: ["$Hits","$Walks"]}, {$add: ["$Hits","$Doubles","$Triples","$Triples","$HR","$HR","$HR"] } ]}, { $add: ["$AtBats","$Walks"] }] } }}, {$sort: {RC: -1}}])

Page 33: Mongo Baseball .NET

Runs Created (Team)

$match

$group

$project

$sort

Page 34: Mongo Baseball .NET

Baseball Salaries Over Time Babe Ruth highest paid player in 20’s ($80K in ‘30/’31)

Babe and Ty Cobb were highest paid in 1920 at $20K

Joe DiMaggio highest paid in 1950 ($100K)

Nolan Ryan made $1M in 1980 (1st time)

Albert Belle made $10M in 1997 In 1999, made ~$12M (more than entire Pirates payroll)

2001 – ARod made $22M

2009 – ARod made $33M

Page 35: Mongo Baseball .NET

Cost Per Base (CPB) Hoerster copyrighted statistic

Compares the value each base produced by a hitter

Who are the most expensive players?

Page 36: Mongo Baseball .NET

Cost Per Base (CPB) Takes total bases

Hits + Doubles + (Triples x 2) + (HR x 3) + SB + BB + HBP – CS

Divides salary into it

Definitely not predictive More of a value statistic

Page 37: Mongo Baseball .NET

Weighted On Base Average

Is a statistic, created by Tom Tango and based on linear regression, designed to measure a player's overall offensive contributions per plate appearance.

- Wikipedia (http://en.wikipedia.org/wiki/Weighted_on-base_average)

Weighs each component of offensive with a factor

((wBB*BB)+(wHBP*HBP)+(wH*Hits)+(w2B*2B)+(w3B*3B)+(wHR*HR)+(wSB*SB)+(wCS*CS))

(AB+BB+HBP+SF-IBB)

Page 38: Mongo Baseball .NET

Weighted On Base Averagevar woba = db.WOBALookup.findOne({_id:2013});

db.batting.aggregate([ {$match: {Year: woba._id}}, {$redact: { $cond: { if: { $gte: ["$AtBats",502] }, then: "$$KEEP", else: "$$PRUNE“ } }}, {$project: { Year: 1, PlayerId: 1, TeamId: 1, WOBA: { $divide: [ {$add: [{$multiply:[woba.wBB,"$BaseOnBalls"]}, {$multiply:[woba.wHBP,"$HitByPitch"]}, {$multiply:[woba.w1B,"$Hits"]}, {$multiply:[woba.w2B,"$Doubles"]}, {$multiply:[woba.w3B,"$Triples"]}, {$multiply:[woba.wHR,"$HomeRuns"]}, {$multiply:[woba.runSB,"$StolenBases"]}, {$multiply:[woba.runCS,"$CaughtStealing"]} ]}, {$add: "]}]} ] } }}, {$limit:25}, {$sort: {WOBA:-1}}, {$out: "2013TopWOBA"}])

Page 39: Mongo Baseball .NET

Weighted On Base Average$match

$redact

$project

$limit

$sort

$out

wOBA_Factors

2013TopWOBA

Page 40: Mongo Baseball .NET

Weighted Runs Above Average Calculates, on average, how many more runs a player generates than the

average player in the league

Uses wOBA as a primary factor in calculation

This then gets figured in for the over WAR of a player

Good description here:http://www.baseball-reference.com/about/war_explained_wraa.shtml

Page 41: Mongo Baseball .NET

Weighted Runs Above Average

var woba = db.WOBALookup.findOne({_id:2013});

db.TopWOBA2013.aggregate([ {$match: {Year: woba._id}}, {$project: { Year: 1, PlayerId: 1, TeamId: 1, wRAA: { $multiply: [ {$divide: [{$subtract: ["$WOBA",woba.wOBA]}, woba.wOBAScale]}, {$add: ["$AtBats","$BaseOnBalls","$HitByPitch",

"$SacrificeFlies","$SacrificeHits"]} ] } }}, {$sort: { wRAA: -1 }}, {$out: 'TopWRAA013'}]);

Page 42: Mongo Baseball .NET

Weighted Runs Above Average

$match$projec

t$sort

$out

wOBA_Factors

'TopWRAA013

Page 43: Mongo Baseball .NET

Wrapping Up Much of aggregate pipeline in Mongo can be done with LINQ

But it will be client-side, not in Mongo!

Take advantage of $out for intermediary tables during processing Stage your operations Maybe intermediary tables can be reused for other calcs

$group id’s can be multi-valued Ends up as a sub-document and must be referenced accordingly

Page 44: Mongo Baseball .NET

Resources Sean Lahman’s Baseball Database

http://seanlahman.com/baseball-archive/statistics/

Society for American Baseball Researchhttp://sabr.org/

wOBA Annual Factorshttp://www.beyondtheboxscore.com/2011/1/4/1912914/custom-woba-and-linear-weights-through-2010-baseball-databank-data

Tom Tango’s Bloghttp://espn.go.com/blog/statsinfo/tag/_/name/tom-tango

Annual Salary Leaders, 1874 – 2012http://sabr.org/research/mlbs-annual-salary-leaders-1874-2012