46
SOCIAL MEDIA ANALYTICS TO QUANTIFY FAN ENGAGEMENT DR. ROBERT BAKER TED KWARTLER Get a more complete profile of your fans to inform business decisions and improve ROI calculations.

Quantifying Fan Engagement using Social Media

Embed Size (px)

DESCRIPTION

Presented at the September 2014 Sports Analytics Innovation Enterprise conference in San Francisco, the presentation covers fan engagement using text mining. Three teams were used to demonstrate basic text mining concepts applied to fan engagement. The analysis was performed on the Cleveland Indians, Los Angeles Dodgers and New York Yankees using the R statistics software.

Citation preview

Page 1: Quantifying Fan Engagement using Social Media

SOCIAL MEDIA ANALYTICS TO QUANTIFY FAN ENGAGEMENTDR. ROBERT BAKER

TED KWARTLER

Get a more complete profile of your fans to inform business decisions and improve ROI calculations.

Page 2: Quantifying Fan Engagement using Social Media

Basics

Where are the fans?

Who are the fans?

What are fans talking about?

How do the fans feel towards the team?

What is the point of all this?

AGENDA

Page 3: Quantifying Fan Engagement using Social Media

If only there had been social media, the Yankees could have profiled my experience.

A FAN’S EXPERIENCE

Page 4: Quantifying Fan Engagement using Social Media

BASICSWHAT IS TEXT MINING?

Page 5: Quantifying Fan Engagement using Social Media

Before text mining. After text mining.

SOCIAL MEDIA ANALYTICS REQUIRES TEXT MINING

Text mining lets you “drink from a fire hose” of information and distill useful meaning.

Page 6: Quantifying Fan Engagement using Social Media

Organized intoDocument Term Matrix (DTM)Term Document Matrix (TDM)

Apply standard and domain specific rules

Unstructured natural language texts

WHAT IS TEXT MINING?

Insight&

Recommendation

Text mining is an emerging technology that can be used to augment existing data by making unstructured text available for analysis and decision making.

surveys tweets

articlesemails

blogs

reviews

Natural language

texts

Page 7: Quantifying Fan Engagement using Social Media

Many sources including emails, forum posts, tweets, books, pdfs, reviews, transcripts etc.

EXAMPLE UNSTRUCTURED TEXT SOURCES

Unstructured natural language texts

杜兰特和詹姆斯谁才是当今联盟的头牌?这是最近很火热的话题。一方面杜兰特高居得分榜首位,在MVP权力榜上也雄踞第一;另一方面詹姆斯带领热火一切为了三连冠,比赛沉稳 ...

Had my first experience at TD Garden when my Bulls came to play the Celtics. Being someone with an out of state license living in Boston, I usually carry my passport anyway, but I had a friend in town and wanted to clear up this ID controversy I read so much about in the rules. 

Page 8: Quantifying Fan Engagement using Social Media

EXAMPLE PRE-PROCESSING STEPS

(or other software e.g. Python NLTK)

1.Make all text lower case2.For twitter, remove “RT” for retweet.

3.Remove symbols like “@”4.Remove punctuation5.Remove numbers6.Remove Urls e.g. http://www.espn.com

7.Remove extra whitespace8.Remove “stopwords”9.Others as needed depending on objective (e.g. stemming)

In a “bag of words” text mining methodology the corpus must be cleaned. Cleaning often means making items lower case, removing

punctuation, numbers and extra whitespace. In unique instancesdomain specific rules are applied (e.g. removing “RT” for retweet).

Apply standard and domain specific rules

Cleaned Version: no doubt derek jeter makes my top all time with babe lou yankee clipper mick

Translated Version:Durant and James, who is the league's first card today? This is a very hot topic recently. On the one hand Durant highest scoring top position in the standings MVP authority also ranked first; on the other hand, James led the Heat everything for three consecutive years, the race calm ...Cleaned Version: durant james who league first card today very hot topic recently on one hand durant highest scoring top position standings MVP authority ranked first other hand, james led heat everything three consecutive years race calm ...

杜兰特和詹姆斯谁才是当今联盟的头牌?这是最近很火热的话题。一方面杜兰特高居得分榜首位,在MVP权力榜上也雄踞第一;另一方面詹姆斯带领热火一切为了三连冠,比赛沉稳 ...

Page 9: Quantifying Fan Engagement using Social Media

Once cleaned the documents and terms are organized into large matrices.

Often they are very sparse and may contain tens of thousands of data points.

Attributes may be single words or word tokens of 2 or more words.

Organized into Document Term Matrix Term Document Matrix

DATA ORGANIZATION

no doubt derek jeter makes my top all time with babe lou yankee clipper mick

Document

no doubt

derek

jeter

top

durant

james

termN

Tweet_1 1 1 1 1 1 0 0 0

Sina_1 0 0 0 0 1 2 2 1

docN … … … … … … … …

Term Tweet_1

Sina_1

docN

no 1 0 …

doubt 1 0 …

jeter 1 0 …

top 1 1 …

termN 0 1 …

durant james who league first card today very hot topic recently on one hand durant highest scoring top position standings MVP authority ranked first other hand, james led heat everything three consecutive years race calm ...

Document Term Matrix

Term Document Matrix

Page 10: Quantifying Fan Engagement using Social Media

WHERE ARE THE FANS?LOCATION BASED ATTRIBUTES

Page 11: Quantifying Fan Engagement using Social Media

DODGERS TWITTER FOLLOWERS -10K SAMPLE

Page 12: Quantifying Fan Engagement using Social Media

INDIANS TWITTER FOLLOWERS -10K SAMPLE

Page 13: Quantifying Fan Engagement using Social Media

NYY TWITTER FOLLOWERS -10K SAMPLE

Page 14: Quantifying Fan Engagement using Social Media

Team Total Followers

Sample

Bing API Geo-Located

Median Distance to Stadium

Dodgers ~540K First 10K

2,854 1,372 miles

Indians ~225K First 10K

3,774 319 miles

Yankees ~1.18K First 10K

1,335 713 miles

Page 15: Quantifying Fan Engagement using Social Media

WHO ARE THE FANS?COMMON DEMOGRAPHIC EXTRACTION

Page 16: Quantifying Fan Engagement using Social Media

Sample of 3262 of 10k Followers Geo-located IDs

Zip City Population

Avg house value

Income below

poverty

Total busines

ses

Total househol

ds

91766

Pomona, CA

71,599 $142,800 15.4% 803

93301

Bakersfield, CA

12,248 $109,600 20.4% 1,438

91606

North Hollywood,

CA

44,958 $170,100 15.4% 622 14,903

From Twitter locations to zip code then demographic data.

WE CAN GET MORE GRANULAR.

Page 17: Quantifying Fan Engagement using Social Media

Sample of 3775 of 10k Followers Geo-located IDs

Zip City Population

Avg house value

Income below

poverty

Total busines

ses

Total househol

ds

44107

Lakewood,

OH

52,244 $117,900 16.4% 945 25,333

44139

Solon, OH

24,356 $215,700 16.4% 1,155 8,693

44304

Akron, OH

5,916 $56,300 13.0% 172 1,637

WE CAN GET MORE GRANULAR.

From Twitter locations to zip code then demographic data.

Page 18: Quantifying Fan Engagement using Social Media

Sample of 1335 of 10k Followers Geo-located IDs

Zip City Population

Avg house value

Income below

poverty

Total busines

ses

Total househol

ds

10462

Bronx, NY

75,784 $192,600 27.9% 1002 29855

14223

Buffalo, NY

22,665 $85,700 13.9% 328 9832

75060

Irving, TX

45,980 $83,300 17.2% 503

WE CAN GET MORE GRANULAR.

From Twitter locations to zip code then demographic data.

Page 19: Quantifying Fan Engagement using Social Media

FURTHER INSIGHTS OF ZIP 91766, POMONA CA

At the zip code and metropolitan area there are countless dimensions that may aid in fan segmentation and marketing.

• Ranked #1 Drought Riskiest Cities• Ranked #15 Riskiest for Identity Theft• Ranked #5 Most Irritation Prone City

Sources: http://www.census.gov

http://emergency.cdc.gov/snaps/data/39/39153.htmhttp://www.bestplaces.net/rankings/zip-code/ohio/akron/44304

• Ranked #8 Healthiest• Ranked #13 Best City for Teleworking• Ranked #6 Most Single City

Population

White Black HispanicAsian Hawaiin IndianOther

Gender

male female

Households

total.households house w/child

Immigration

Mexico El Savador PhilippinesGutemala Korea ChinaVietnam Iran

Page 20: Quantifying Fan Engagement using Social Media

FURTHER INSIGHTS OF ZIP 44304, AKRON OH

Population

White Black AsianHawaiin Indian Other

Gender

male female

Households

total.households house w/child

Immigration

India Germany YugoslaviaUK Italy CanadaChina other

• Ranked #1 Best City for Thanksgiving• Ranked #4 Best Cities for Teleworking• Ranked #25 America’s Best Cities for Dating

Sources: http://www.census.gov

http://emergency.cdc.gov/snaps/data/39/39153.htmhttp://www.bestplaces.net/rankings/zip-code/ohio/akron/44304

• Ranked #64 Most Popular City for the Holidays• Ranked #73 America’s Most Stressful Cities• Ranked #140 2005 Best Places to Live

At the zip code and metropolitan area there are countless dimensions that may aid in fan segmentation and marketing.

Page 21: Quantifying Fan Engagement using Social Media

FURTHER INSIGHTS OF ZIP 10462, BRONX NY

• Ranked #2 Least Crime for Large Metro Area

• Ranked #2 Sleepless Cities 2011• Ranked #3 Most Single Cities

Sources: http://www.census.gov

http://emergency.cdc.gov/snaps/data/39/39153.htmhttp://www.bestplaces.net/rankings/zip-code/ohio/akron/44304

• Ranked #9 Most Irritation Prone Cities• Ranked #14 Healthiest Cities• Ranked #28 Most Playful Cities

Population

White Black HispanicAsian Hawaiin IndianOther

Gender

male female

Households

total.households house w/child

Immigration

Dominican Jamaica MexicoGuyana Ecuador CaribbeanHonduras Ghana

At the zip code and metropolitan area there are countless dimensions that may aid in fan segmentation and marketing.

Page 22: Quantifying Fan Engagement using Social Media

WHAT ARE THE FANS TALKING ABOUT?INTERESTING TOPICS AND NAMED ENTITY RECOGNITION

Page 23: Quantifying Fan Engagement using Social Media

• Free Twitter API

• Tweets mentioning “Indians”

• 7/31 & 8/1

• “Tokenize” single words into unique two word groups

• Trade mentions• Masterson to Cardinals for Ramsey• Cabrera to Nationals for Walters

• Throwback jerseys for KC Royals game

• Mariners game attendees 7/31

1.1K Tweets

Page 24: Quantifying Fan Engagement using Social Media

DIFFERENCES OF WORD CLOUDS SIMPLE WORD CLOUD, CLOUD, COMMON CLOUD AND POLARIZED CLOUD

text1 text2

text2

text1 text2 text2text1

Simple Word Cloud

Commonality & Polarized Cloud

Comparison Cloud

Page 25: Quantifying Fan Engagement using Social Media

12K Tweets• Includes a mix free API access and full fire hose paid API over 48 distinct hours

• Sampling occurred August 1 and August 13

• Tweets mentioning “Dodgers” most often discussed

• Clayton Kershaw’s appearance on Jimmy Kimmel Live

• FCC Chairman’s letter to Time Warner CEO about the Dodger’s TV Channel

Page 26: Quantifying Fan Engagement using Social Media

2K Spanish Tweets

• Free Twitter API Spanish language search over 48 distinct hours

• Sampling occurred July 29 and August 12

• Tweets mentioning “Dodgers” and used Spanish most often discussed

• The AP story of Dan Haren beating the Braves

• Vin Scully retiring was a smaller topic although present

Dodgers beat Braves with 2 homers Kemp http://t.co/9U7xiIPOdo #news

Example:Dodgers vencen a Bravos con 2 jonrones de Kemp http://t.co/9U7xiIPOdo #noticias

Page 27: Quantifying Fan Engagement using Social Media

235 BlogsTreemap

Sentiment

• July 29-July 31• Group is Correlated Topic Modeling

• Color is sentiment

• Area is blog length

• Takeaways:• Babe Ruth’s birthday is shared with

Laurence Fishburn, born in Augusta Georgia – picked up blogs mentioning “birthdays on this date”

• Eli Manning wants to remember advice of Derek Jeter

• Pending trade deadline• ESPNNewYork writer Wallace Matthews• Game recaps

Page 28: Quantifying Fan Engagement using Social Media

Dissimilar Words

• Full FB Firehose of public posts

• Sampling occurred • Dodgers:July 29 – July 31• Yankees:July 28 – July 31

• FB mentions of Dodgers and Yankees tagged as English

• Marketing posts about Spike Lee requested a Red New York Yankees World Series edition fitted cap

Page 29: Quantifying Fan Engagement using Social Media

Words in Common

• Full FB Firehose of public posts

• Sampling occurred • Dodgers:July 29 – July 31• Yankees:July 28 – July 31

• FB mentions of Dodgers and Yankees tagged as English

• As expected trades to improve the season towards the end of the deadline were mentioned by both teams

Page 30: Quantifying Fan Engagement using Social Media

COMPARATIVE ANALYSIS – BIGRAMS IN COMMON

• Full FB Firehose of public posts

• Sampling occurred • Dodgers: Jul 29, -- Jul 31• Yankees: Jul 28 – Jul 31

• FB mentions of Dodgers and Yankees tagged as English

red sox

Equal Mentions

Page 31: Quantifying Fan Engagement using Social Media

FEELINGS TOWARDS THE TEAMSIMPLE SENTIMENT ANALYSIS

Page 32: Quantifying Fan Engagement using Social Media

Many words in natural language but there is steep decline in everyday usage.

Follows a predictable distribution. Zipf’s Law

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 970

100,000

200,000

300,000

400,000

500,000

600,000

700,000

800,000

900,000

EXAMPLE POLARITY SCORING IN TWITTER

Top two words in English spoken language are “the” and “be”. Top two words in Twitter are “RT” and “I”. However the power distribution is similar and follows Zipf’s law.

Top 100 Word Usage from 3M Tweets

Page 33: Quantifying Fan Engagement using Social Media

Surprise is a sentiment. Hit by a bus! – Negative polarity but surprising.

Won the lottery! – Positive polarity but still surprising.

Use the University of Pittsburgh’s MPQA Lexicon & Illocution Inc’s 10K top Twitter words.

Keyword Scanning for polarity

SENTIMENT POLARITY ANALYSIS

R script scans for 3546 positive words, and 5701 negative words. It adds

positive words and subtracts negative ones. The final

score represents the polarity of the social interaction.

•I loathe the Tigers. -1•I love Lou Whittaker. He was the best. +2

•I like the Tigers but dislike going to the stadium. 0

Page 34: Quantifying Fan Engagement using Social Media

DODGER SENTIMENT ON TWITTER 9/5

Median: -1Mean: -0.47

Page 35: Quantifying Fan Engagement using Social Media

INDIANS SENTIMENT ON TWITTER 9/5

Median: 0Mean: -0.1198

Page 36: Quantifying Fan Engagement using Social Media

YANKEE SENTIMENT ON TWITTER 9/5

Median: 0Mean: -0.118

Page 37: Quantifying Fan Engagement using Social Media

IN COMPARISON…

dodgers rhp josh beckett won't return this season

hey..yankees....can ya score some runs?!

indians activate murphy from disabled list http://t.co/bqliintwsf

Team Tweets>=1

Tweets<=-1

Total w/o 0

% positive

Yankees 280 406 686 41%

Indians 290 456 746 39%

Dodgers 448 1,226 1,674 27%

Page 38: Quantifying Fan Engagement using Social Media

WHAT IS THE POINT OF ALL THIS?TARGETED MARKETING EFFORTS, EVANGELISTS, REFINED SEGMENTATION, MEDIA MIX MODELING LEADING TO ROI

Page 39: Quantifying Fan Engagement using Social Media

EXAMPLE IDENTIFY EVANGELISTS, INFLUENCERS & DETRACTORS

• When engaging on social media it is important to note the clout of followers in terms of status updates, and followers

• Running sentiment analysis on updates/posts adds context to the voice of the customer

• Appending other data allows for additional segmentation, and differentiated customer experiences e.g. my Yankee story

10K Indians Followers less 138 outliers

Page 40: Quantifying Fan Engagement using Social Media

MEDIA MIX MODELING FOR SOCIAL MEDIA ROI

• In lieu of actual sales merchandise data and marketing spend, tracked Amazon Sales Rank hourly from 4/1 to 8/31

•Relative measure of sales against other “Sports and Outdoors” category items

•Lower number is better

Page 41: Quantifying Fan Engagement using Social Media

DODGER CAP AVERAGE HOURLY SALES RANK PER DAY

1-Ap

r

4-Ap

r

7-Ap

r

10-A

pr

13-A

pr

16-A

pr

19-A

pr

22-A

pr

25-A

pr

28-A

pr

1-May

4-May

7-May

10-M

ay

13-M

ay

16-M

ay

19-M

ay

22-M

ay

25-M

ay

28-M

ay

31-M

ay

3-Ju

n6-

Jun9-

Jun

12-Ju

n

15-Ju

n

18-Ju

n

21-Ju

n

24-Ju

n

27-Ju

n

30-Ju

n3-

Jul6-

Jul9-

Jul

12-Ju

l

15-Ju

l

18-Ju

l

21-Ju

l

24-Ju

l

27-Ju

l

30-Ju

l

2-Au

g

5-Au

g

8-Au

g

11-A

ug

14-A

ug

17-A

ug

20-A

ug

23-A

ug

26-A

ug

29-A

ug

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Amazon sales rank when seen as a time series exhibits is not stationary. Overall the Dodgers has an increasing trend despite being successful on field and has some periodicity based on

day of week.

Page 42: Quantifying Fan Engagement using Social Media

Time Series Decomposition

• Econometric forecasting TSD was used in an attempt to isolate social media impact and understand sales rank patterns

• Trend is likely the impact of baseball season excitement then waning to other sports

• Seasonal may be the impact of retail day of the week cycles

• Leaving random as the dependent variable in the media mix GLM

Page 43: Quantifying Fan Engagement using Social Media

Tweets to Decomposed Amazon Sales Rank

• Correlation is only -0.08.

• Given the tweets are examined against ‘random’ or unexplained data the relationship may still be relevant.

• As this is proxy data for sales of a single item, results not conclusive

0 10 20 30 40 50 60 70 80 90 100

-1000

-800

-600

-400

-200

0

200

400

600

800

1000

*removed dates with missing data

Page 44: Quantifying Fan Engagement using Social Media

Tweets to Average Daily Amazon Sales Rank

• Much stronger correlation -0.24

• Leads one to believe the more a team tweets the lower the sales rank

• As this is proxy data for sales of a single item, results not conclusive

*removed dates with missing data

0 10 20 30 40 50 60 70 80 90 1000

500

1000

1500

2000

2500

3000

3500

4000

4500

Page 45: Quantifying Fan Engagement using Social Media

Media mix modeling

*removed dates with missing data

• Given the likely relationship:• Set up a GLM using marketing efforts media spend with the dependent variable being revenue, ticket sales, merchandise sales etc.

• The coefficients of the inputs illustrate the impact of the channel marketing spends leading you to ROI

𝑓 (𝑠𝑎𝑙𝑒𝑠 )=𝛽 0+𝛽1 (𝑠𝑜𝑐𝑖𝑎𝑙 .𝑚𝑒𝑑𝑖𝑎 .𝑠𝑝𝑒𝑛𝑑 )+𝛽2 (𝑡𝑟𝑎𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 .𝑚𝑘𝑡𝑔 .𝑠𝑝𝑒𝑛𝑑 )+𝛽3 (𝑡𝑒𝑎𝑚 .𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 )…𝛽𝑛+𝜖Example:

The goal is increased model lift, and accuracy by incorporating social media spend. The coefficient of the variable demonstrates the impact. This will allow you to

calculate a ROI of social spend.

Page 46: Quantifying Fan Engagement using Social Media

Want example R scripts for the visuals?www.sportsanalytics.org starting 9/15

FURTHER INFO