View
445
Download
1
Embed Size (px)
DESCRIPTION
Using the MongoDB Aggregation Pipeline, we'll look at how we can calculate several baseball statistics, including a few SABRmetric stats.
Citation preview
MongoBaseball.NetDavid Hoerster
2014
About Me C# MVP (Since April 2011)
Sr. Director of Web Solutions at RGP
Conference Director for Pittsburgh TechFest
Co-Founder of BrainCredits (braincredits.com)
Past President of Pittsburgh .NET Users Group and organizer of recent Pittsburgh Code Camps and other Tech Events
Twitter - @DavidHoerster
Blog – http://geekswithblogs.net/DavidHoerster
Email – [email protected]
Best of Both Worlds
+
Assumptions Basic understanding of document databases, like Mongo
Familiarity of querying (non-aggregate pipeline) in Mongo
General understanding of baseball
Baseball Statistics Basics of AVG, OBP, ERA have been around
Underground of advanced statistics been growing since early 70s Bill James is probably most well known
Society for American Baseball Research (SABR) Fosters the research of baseball statistical history
Stats like wOBA, wRAA, WAR, DIPS, NERD and more Lends itself to computer modeling and big data
Basic Mongo Document database
A “NoSQL” solution
Wide range of querying and manipulation capabilities
Querying Mongo Data Issue a JSON document
find and findOne like LINQ Select and First/Single methods
Basic cursor functionality (think DataReader)
Mongo C# Driver Download as a NuGet package
Actively worked on and contributed to
There is an “official” client, along with several community clients
Mongo’s Aggregation Pipeline MongoDB’s data aggregation solution
Modeled on the concept of data processing pipelines
Operations are performed in stages Results from one stage “piped” to the next stage
$match
$project
$sort
Aggregation Pipeline Number of operations available
$group, $match, $project, $sort, $limit, $skip, $redact, $out, …
Essentially replaces the older mapReduce functionality Aggregation Pipeline provides better performance, generally mapReduce is more flexible
Aggregation combines a number of operations in order to produce a result set
Aggregation Pipeline Maximum size of a returned document is 16 MB
Aggregation Pipeline now returns results using cursor (as of 2.6)
Each stage of a pipeline has a maximum limit of 100MB of RAM Enable allowDiskUse in order to write to disk and avoid this limitation
MongoDB will also optimize the pipeline, if possible
Simple Single Purpose Op Count
Simple Projection Batting Average (Hits / At Bats)
Simple Projection Batting Average
Simple Projection Batting Average in C#
LINQ to Mongo Part of Mongo C# Driver
Implements find and findOne
Other grouping and projecting done client-side
Do you want all that data before manipulating it?
Top 25 Batting Average in 2013 Add a $match pipeline operation
Top 25 Batting Average in 2013 Now need to sort
Top 25 Batting Average in 2013 But wait…we have incorrect results for top Batting Average
Need to enhance $match to include those with 3.1 PA for 162 games
Top 25 Batting Average in 2013 In C# Using LINQ
But… Not truly aggregation pipeline in C#
Done on client, not server
Materialize on client with LINQ
Must use BsonDocument for aggregation pipeline
Yikes!
Aggregation Pipeline in C# Creating the $match BsonDocument
var match = new BsonDocument{ {"$match", new BsonDocument{ {"Year", 2013}, {"AtBats", new BsonDocument{ {"$gte", 502} }} }} };
Aggregation Pipeline in C# Create the $project operation
var project = new BsonDocument { {"$project", new BsonDocument{ {"PlayerId", 1}, {"Year", 1}, {"TeamId", 1}, {"AVG", new BsonDocument{ {"$cond", new BsonDocument{ {"if", new BsonDocument{ {"$eq", new BsonArray{"$AtBats", "0"}} }}, {"then", 0}, {"else", new BsonDocument{ {"$divide", new BsonArray{"$Hits", "$AtBats"}} }} }} }} }} };
Aggregation Pipeline in C# Create the $sort and $limit operations and then combine them all in an
Array var sort = new BsonDocument{ {"$sort", new BsonDocument{ {"AVG", -1} } } };
var limit = new BsonDocument{ {"$limit", 25} };
return new[] { match, project, sort, limit };
Aggregation Pipeline in C# All the { } with BsonDocument and BsonArray reminds me of…
On Base Percentage (OBP)
A measure of how often a batter reaches base for any reason other than a fielding error, fielder's choice, dropped/uncaught third strike, fielder's obstruction, or catcher's interference.
- Wikipedia (http://en.wikipedia.org/wiki/On-base_percentage)
Usually a better measure of batter’s performance than straight average
(H + BB + HBP) / (AB + BB + HBP + SF)
On Base Percentage (OBP)(Hits + BB + HBP) / (AB + BB + HBP + SF)db.batting.aggregate([ {$match: { Year: 2013, AtBats: {$gte: 502} }}, {$project: { PlayerId: 1, Year: 1, TeamId: 1, OBP: { $cond: { if: {$eq: ["$AtBats", 0] }, then: 0, else: { $divide: [ {$add:["$Hits","$BaseOnBalls","$HitByPitch"]}, {$add:["$AtBats","$BaseOnBalls","$HitByPitch","$SacrificeFlies"]} ]} }} }}, {$sort: {OBP: -1}}, {$limit: 25}])
On Base Percentage (OBP)
$match
$project
$sort
$limit
Runs Created (Player)
Early SABRmetric type of stat, invented by Bill James
With regard to an offensive player, the first key question is how many runs have resulted from what he has done with the bat and on the basepaths. Willie McCovey hit .270 in his career, with 353 doubles, 46 triples, 521 home runs and 1,345 walks -- but his job was not to hit doubles, nor to hit singles, nor to hit triples, nor to draw walks or even hit home runs, but rather to put runs on the scoreboard. How many runs resulted from all of these things?
- Bill James (James, Bill (1985). The Bill James Historical Baseball Abstract (1st ed.), pp. 273-4. Villard. ISBN 0-394-53713-0)
((H + BB) x TB) / (AB + BB)
Aggregated across a team, RC is usually within 5% of a team’s actual runs
Runs Created (Player)(Hits + Walks) * Total Bases / (At Bats + Walks)
db.batting.aggregate([ {$match: {Year:2013, AtBats:{$gte:502}}}, {$project: { PlayerId: 1, Year: 1, TeamId: 1, RC: { $divide: [ {$multiply: [ {$add: ["$Hits","$BaseOnBalls"]}, {$add: ["$Hits","$Doubles","$Triples","$Triples",
"$HomeRuns","$HomeRuns","$HomeRuns"] }]},
{ $add: ["$AtBats","$BaseOnBalls"] }] } }}, {$sort: {RC:-1}}, {$limit: 25}])
Runs Created (Player)
$match
$project
$sort
$limit
Runs Created (Team)db.batting.aggregate([ {$match: {Year:2013}}, {$group: { _id: "$TeamId", Hits: {$sum: "$Hits"}, Walks: {$sum: "$BaseOnBalls"}, Doubles: {$sum: "$Doubles"}, Triples: {$sum: "$Triples"}, HR: {$sum: "$HomeRuns"}, AtBats: {$sum: "$AtBats"} }}, {$project: { RC: { $divide: [ {$multiply: [ {$add: ["$Hits","$Walks"]}, {$add: ["$Hits","$Doubles","$Triples","$Triples","$HR","$HR","$HR"] } ]}, { $add: ["$AtBats","$Walks"] }] } }}, {$sort: {RC: -1}}])
Runs Created (Team)
$match
$group
$project
$sort
Baseball Salaries Over Time Babe Ruth highest paid player in 20’s ($80K in ‘30/’31)
Babe and Ty Cobb were highest paid in 1920 at $20K
Joe DiMaggio highest paid in 1950 ($100K)
Nolan Ryan made $1M in 1980 (1st time)
Albert Belle made $10M in 1997 In 1999, made ~$12M (more than entire Pirates payroll)
2001 – ARod made $22M
2009 – ARod made $33M
Cost Per Base (CPB) Hoerster copyrighted statistic
Compares the value each base produced by a hitter
Who are the most expensive players?
Cost Per Base (CPB) Takes total bases
Hits + Doubles + (Triples x 2) + (HR x 3) + SB + BB + HBP – CS
Divides salary into it
Definitely not predictive More of a value statistic
Weighted On Base Average
Is a statistic, created by Tom Tango and based on linear regression, designed to measure a player's overall offensive contributions per plate appearance.
- Wikipedia (http://en.wikipedia.org/wiki/Weighted_on-base_average)
Weighs each component of offensive with a factor
((wBB*BB)+(wHBP*HBP)+(wH*Hits)+(w2B*2B)+(w3B*3B)+(wHR*HR)+(wSB*SB)+(wCS*CS))
(AB+BB+HBP+SF-IBB)
Weighted On Base Averagevar woba = db.WOBALookup.findOne({_id:2013});
db.batting.aggregate([ {$match: {Year: woba._id}}, {$redact: { $cond: { if: { $gte: ["$AtBats",502] }, then: "$$KEEP", else: "$$PRUNE“ } }}, {$project: { Year: 1, PlayerId: 1, TeamId: 1, WOBA: { $divide: [ {$add: [{$multiply:[woba.wBB,"$BaseOnBalls"]}, {$multiply:[woba.wHBP,"$HitByPitch"]}, {$multiply:[woba.w1B,"$Hits"]}, {$multiply:[woba.w2B,"$Doubles"]}, {$multiply:[woba.w3B,"$Triples"]}, {$multiply:[woba.wHR,"$HomeRuns"]}, {$multiply:[woba.runSB,"$StolenBases"]}, {$multiply:[woba.runCS,"$CaughtStealing"]} ]}, {$add: "]}]} ] } }}, {$limit:25}, {$sort: {WOBA:-1}}, {$out: "2013TopWOBA"}])
Weighted On Base Average$match
$redact
$project
$limit
$sort
$out
wOBA_Factors
2013TopWOBA
Weighted Runs Above Average Calculates, on average, how many more runs a player generates than the
average player in the league
Uses wOBA as a primary factor in calculation
This then gets figured in for the over WAR of a player
Good description here:http://www.baseball-reference.com/about/war_explained_wraa.shtml
Weighted Runs Above Average
var woba = db.WOBALookup.findOne({_id:2013});
db.TopWOBA2013.aggregate([ {$match: {Year: woba._id}}, {$project: { Year: 1, PlayerId: 1, TeamId: 1, wRAA: { $multiply: [ {$divide: [{$subtract: ["$WOBA",woba.wOBA]}, woba.wOBAScale]}, {$add: ["$AtBats","$BaseOnBalls","$HitByPitch",
"$SacrificeFlies","$SacrificeHits"]} ] } }}, {$sort: { wRAA: -1 }}, {$out: 'TopWRAA013'}]);
Weighted Runs Above Average
$match$projec
t$sort
$out
wOBA_Factors
'TopWRAA013
Wrapping Up Much of aggregate pipeline in Mongo can be done with LINQ
But it will be client-side, not in Mongo!
Take advantage of $out for intermediary tables during processing Stage your operations Maybe intermediary tables can be reused for other calcs
$group id’s can be multi-valued Ends up as a sub-document and must be referenced accordingly
Resources Sean Lahman’s Baseball Database
http://seanlahman.com/baseball-archive/statistics/
Society for American Baseball Researchhttp://sabr.org/
wOBA Annual Factorshttp://www.beyondtheboxscore.com/2011/1/4/1912914/custom-woba-and-linear-weights-through-2010-baseball-databank-data
Tom Tango’s Bloghttp://espn.go.com/blog/statsinfo/tag/_/name/tom-tango
Annual Salary Leaders, 1874 – 2012http://sabr.org/research/mlbs-annual-salary-leaders-1874-2012